[jira] [Created] (TIKA-2483) Using PackageParser in ForkParser causes NPE

2017-10-26 Thread TzeKai Lee (JIRA)
TzeKai Lee created TIKA-2483:


 Summary: Using PackageParser in ForkParser causes NPE
 Key: TIKA-2483
 URL: https://issues.apache.org/jira/browse/TIKA-2483
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.16
Reporter: TzeKai Lee


{quote}
Caused by: java.lang.NullPointerException
at 
org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:158)
at 
org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:577)
at 
org.apache.tika.config.TikaConfig.getDefaultMimeTypes(TikaConfig.java:78)
at org.apache.tika.config.TikaConfig.(TikaConfig.java:242)
at 
org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:379)
at 
org.apache.tika.parser.pkg.PackageParser.parse(PackageParser.java:165)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
{quote}

The mediaTypeRegistry handling code in parse() of PackageParser seems cause the 
problem due to ForkParser cannot properly construct default TikaConfig. Also 
since TikaConfig is not serializable, there is no way to assign 
mediaTypeRegistry/bufferedMediaTypeRegistry before calling parse()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
On collision, the precedence order defines what key takes precedence and 
_overwrites_ the
other. Overwrite is but one option (you could save *all* the values it’s a 
multi-valued key structure
so…)

Cheers,
Chris




On 10/26/17, 9:43 AM, "Nick Burch"  wrote:

On Thu, 26 Oct 2017, Chris Mattmann wrote:
> My general approach to conflicting metadata is simply to define 
> precedence orders.
>
> For example here is one documented from OODT:
>
> 
https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
>
> We can do similar things with Tika, e.g.,
>
> [CoreMetadata.PROPERTIES]
> [ImageParser.METADATA]
> [TikaOCR.METADATA]

What happens if two different parsers both output the same bit of metadata 
though? eg Tim's example of one giving dc:creator of Tim and the second 
giving dc:creator of Chris?


Secondly, what about the XHTML sax events stream? I think that's probably 
the harder case...

Nick





Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch

On Thu, 26 Oct 2017, Chris Mattmann wrote:
My general approach to conflicting metadata is simply to define 
precedence orders.


For example here is one documented from OODT:

https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence

We can do similar things with Tika, e.g.,

[CoreMetadata.PROPERTIES]
[ImageParser.METADATA]
[TikaOCR.METADATA]


What happens if two different parsers both output the same bit of metadata 
though? eg Tim's example of one giving dc:creator of Tim and the second 
giving dc:creator of Chris?



Secondly, what about the XHTML sax events stream? I think that's probably 
the harder case...


Nick


Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Thanks Nick.

My general approach to conflicting metadata is simply to define precedence 
orders.

For example here is one documented from OODT:

https://cwiki.apache.org/confluence/display/OODT/Understanding+CAS-PGE+Metadata+Precendence
 

We can do similar things with Tika, e.g.,

[CoreMetadata.PROPERTIES]
[ImageParser.METADATA]
[TikaOCR.METADATA]
…

And then start with the top, and then overlay heading downwards. Make sense?

Cheers,
Chris

P.S. The metadata key/value merging principles could be configurable, but a 
default base one of
overlay according to some configured precedence order maybe in tika-config.xml 
would be a fine
start.




On 10/26/17, 9:14 AM, "Nick Burch"  wrote:

On Thu, 26 Oct 2017, Chris Mattmann wrote:
> Why don’t we just store N copies of the stream, and parse it twice?

I'm not sure that's the challenge though? Using TikaInputStream we can 
buffer to a temp file if needed to re-read the input

> Of course that’s the ugly way, but currently the way I’ve hacked this in 
> all of my projects is simply to call Tika N times OUTSIDE of Tika. Why 
> don’t we just use that as the weakest baseline and work backwards from 
> there?

I think our main challenge right now is on the output end. How do you deal 
with multiple different Metadata results that might clash after running 
Tika server times? How do you deal with multiple (some potentially empty, 
some overlapping) XHTML outputs from multiple parses? Can we copy those 
approaches?

Thanks
Nick




Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch

On Thu, 26 Oct 2017, Chris Mattmann wrote:

Why don’t we just store N copies of the stream, and parse it twice?


I'm not sure that's the challenge though? Using TikaInputStream we can 
buffer to a temp file if needed to re-read the input


Of course that’s the ugly way, but currently the way I’ve hacked this in 
all of my projects is simply to call Tika N times OUTSIDE of Tika. Why 
don’t we just use that as the weakest baseline and work backwards from 
there?


I think our main challenge right now is on the output end. How do you deal 
with multiple different Metadata results that might clash after running 
Tika server times? How do you deal with multiple (some potentially empty, 
some overlapping) XHTML outputs from multiple parses? Can we copy those 
approaches?


Thanks
Nick

Re: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Chris Mattmann
Why don’t we just store N copies of the stream, and parse it twice?

Of course that’s the ugly way, but currently the way I’ve hacked this in all of
my projects is simply to call Tika N times OUTSIDE of Tika. Why don’t we just 
use
that as the weakest baseline and work backwards from there?

Chris




On 10/26/17, 3:56 AM, "Nick Burch"  wrote:

Hi All

Based on the plan on the wiki 
 
, we still have a 
major breaking change or two planned for Tika 2 that we haven't yet 
"broken". (There's also removing some deprecated stuff etc)


As I understand it, the biggest breaking TODO change is around having 
multiple parsers available + active for a given format. This could be to 
support fallback parsers, eg "try this fancy new parser, but if it falls 
retry with this simpler one" or "try this xml parser, if that fails just 
try strings". A related but different case is to cleanly support multiple 
parsers covering different aspects, eg OCR an image plus extract metadata, 
or NER on the contents of a scientific PDF + text + metadata + NER of the 
OCR of embedded images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly) 
handled via one parser (eg OCR or NER) having an embedded hard-code 
reference to another (eg Image or PDF).


We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers 
interact with the SAX content handler. For the fallback case, that's how 
to say "sorry, ignore all that XML we already sent, we're starting again 
with this XML now". For the multiple parser case, it's how we could have 
the image parser "finish" the (empty) XHTML but then have the OCR one send 
some text, or have the NER parser get at the XHTML text of the PDF + OCR 
of embedded images to enhance with the entities.


What do we think for this? Can we come up with a solution to let this go 
forward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and 
do this stuff in Tika 3 instead?

Nick





RE: Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Allison, Timothy B.
At this point, I'm willing to punt to 3.x, unless there's momentum for either 
of these two.  They would be great to have!

1) chaining multiple parsers -- additive
This shouldn't be too bad, except where there's conflicting metadata -- parser1 
says author is 'bob', parser2 says author is 'alice'.  We would break some 
uniqueness guarantees for some Properties that should only allow a single value 
if we added those values...  Overwriting feels like a bad idea.  Perhaps we 
remove the uniqueness guarantees when in "additive" mode ... or let users 
select additive/overwrite?

2) fallback parsers 
>The biggest stumbling block, as I see it, is how to let multiple parsers 
>interact with the SAX content handler. For the fallback case, that's how to 
>say "sorry, ignore all that XML we already sent, we're starting again with 
>this XML now".

Y, this has been what's holding me back.  How do we create a resettable handler 
that doesn't have us mucking too much with all of our current handlers.  For 
those with outputstreams/writers,  I imagine we'd require a resettable 
OutputStream...TikaOutputStream(?)

TikaOutputStream() --underling stringwriter, when reset, would just be a new 
stringwriter on reset() ??? Not quite right...
TikaOutputStream.get(Path/File) -- would hold the underlying file/path, close 
the writer, and just rewrite on reset()
TikaOutputStream.get(ByteArrayOutputStream)  baos has a reset() so that should 
work...

What other use cases?




-Original Message-
From: Nick Burch [mailto:n...@apache.org] 
Sent: Thursday, October 26, 2017 6:57 AM
To: dev@tika.apache.org
Subject: Not-yet-broken breaking changes for Tika 2?

Hi All

Based on the plan on the wiki

, we still have a major 
breaking change or two planned for Tika 2 that we haven't yet "broken". 
(There's also removing some deprecated stuff etc)


As I understand it, the biggest breaking TODO change is around having multiple 
parsers available + active for a given format. This could be to support 
fallback parsers, eg "try this fancy new parser, but if it falls retry with 
this simpler one" or "try this xml parser, if that fails just try strings". A 
related but different case is to cleanly support multiple parsers covering 
different aspects, eg OCR an image plus extract metadata, or NER on the 
contents of a scientific PDF + text + metadata + NER of the OCR of embedded 
images in the PDF.

Currently, we can't cleanly do the former, and the latter is (badly) handled 
via one parser (eg OCR or NER) having an embedded hard-code reference to 
another (eg Image or PDF).


We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers 
interact with the SAX content handler. For the fallback case, that's how to say 
"sorry, ignore all that XML we already sent, we're starting again with this XML 
now". For the multiple parser case, it's how we could have the image parser 
"finish" the (empty) XHTML but then have the OCR one send some text, or have 
the NER parser get at the XHTML text of the PDF + OCR of embedded images to 
enhance with the entities.


What do we think for this? Can we come up with a solution to let this go 
forward? Is there a pattern from elsewhere we can follow?

Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and do 
this stuff in Tika 3 instead?

Nick


Not-yet-broken breaking changes for Tika 2?

2017-10-26 Thread Nick Burch

Hi All

Based on the plan on the wiki 
 
, we still have a 
major breaking change or two planned for Tika 2 that we haven't yet 
"broken". (There's also removing some deprecated stuff etc)



As I understand it, the biggest breaking TODO change is around having 
multiple parsers available + active for a given format. This could be to 
support fallback parsers, eg "try this fancy new parser, but if it falls 
retry with this simpler one" or "try this xml parser, if that fails just 
try strings". A related but different case is to cleanly support multiple 
parsers covering different aspects, eg OCR an image plus extract metadata, 
or NER on the contents of a scientific PDF + text + metadata + NER of the 
OCR of embedded images in the PDF.


Currently, we can't cleanly do the former, and the latter is (badly) 
handled via one parser (eg OCR or NER) having an embedded hard-code 
reference to another (eg Image or PDF).



We've got some details on the proposed plans and ideas on the wiki:
https://wiki.apache.org/tika/CompositeParserDiscussion

The biggest stumbling block, as I see it, is how to let multiple parsers 
interact with the SAX content handler. For the fallback case, that's how 
to say "sorry, ignore all that XML we already sent, we're starting again 
with this XML now". For the multiple parser case, it's how we could have 
the image parser "finish" the (empty) XHTML but then have the OCR one send 
some text, or have the NER parser get at the XHTML text of the PDF + OCR 
of embedded images to enhance with the entities.



What do we think for this? Can we come up with a solution to let this go 
forward? Is there a pattern from elsewhere we can follow?


Or do we need to cancel this for 2.x, ponder it for another 1-2 years, and 
do this stuff in Tika 3 instead?


Nick


Re: Tika 2 parsers

2017-10-26 Thread Gethin James
The usecase is really when embedding Tika and transitive dependencies.  I
prefer the Tika 2 modular approach as it pulls in less jars, however, I
don't have some much control over my existing version of PDFBox.  I will
explore using Tika Server!

On 25 October 2017 at 17:44, Allison, Timothy B.  wrote:

> Sorry, Tika 2.0 will require PDFBox 2.x at least.  There were some
> breaking changes btwn PDFBox 1.x and 2.x, and our PDFParser relies on 2.x
> now.
>
> Is there something in PDFBox 1.8.x that you need that doesn't exist in 2.x?
>
> -Original Message-
> From: Gethin James [mailto:gja...@nuxeo.com]
> Sent: Wednesday, October 25, 2017 8:20 AM
> To: dev@tika.apache.org
> Subject: Re: Tika 2 parsers
>
> Thanks for the help, I gave the parsers a go.  Just a question on the
> PDFBox dependency you mentioned.  Will Tika 2.0 require a minimum PDFBox
> version? I am embedding Tika and have pdfbox 1.8.9 so wondering if that
> work?
>
> On 25 October 2017 at 10:49, Sergey Beryozkin 
> wrote:
>
> > As Tim indicated the 2.x line is not actively developed at the moment,
> > but what is already there now is sufficient for the initial try (ex.
> > with PDF/ODT parsers)
> >
> > Sergey
> >
> >
> >
> > On 25/10/17 08:30, Gethin James wrote:
> >
> >> I did have a look for the source, what branch is it?
> >> https://github.com/apache/tika/tree/2.x doesn't seem to have been
> >> updated since May.
> >>
> >> On 24 October 2017 at 22:15, Sergey Beryozkin 
> >> wrote:
> >>
> >> I did try the modules in the earlier version of the CXF demo,
> >>>
> >>> see the right panel,
> >>>
> >>> https://github.com/apache/cxf/commit/c2ccecb23ba23497c95be89
> >>> f9b37f38c69faba7a#diff-b5ed531ebf92978dcbcf1ac6cc6331c0
> >>>
> >>> They should be available in the snapshot repo
> >>>
> >>> Cheers, Sergey
> >>>
> >>> On 24/10/17 19:45, Allison, Timothy B. wrote:
> >>>
> >>> We'll switch master over to the 2.0 layout after our next release,
> >>> which
>  should happen shortly after the release of PDFBox 2.0.8...roughly
>  in the next week for PDFBox, next month for Tika.
> 
>  We have abandoned keeping the current 2.x up to date, and I was
>  hoping there would at least be a build here:
>  https://builds.apache.org/view /T/view/Tika/job/tika-2.x/, but there
> isn't a clean build there.
> 
>  So, unfortunately, for now, your best bet is to build it yourself
>  from source.  Sorry.
> 
> 
> 
>  -Original Message-
>  From: Gethin James [mailto:gja...@nuxeo.com]
>  Sent: Tuesday, October 24, 2017 12:19 PM
>  To: dev@tika.apache.org
>  Subject: Tika 2 parsers
> 
>  Hi, I am interested in trying the more modular approach of using
>  the Tika
>  2 parsers.  Are the Tika 2 artifacts available in a maven repo
>  somewhere?
>  Is the any documentation on how to use them or how they differ from
>  Tika 1?
> 
>  Thanks,
>  Gethin.
> 
> 
> 
> >>
>