Bob,
Thank you, again. This looks promising!
To continue down the strawman path and to start discussion on the elephant in
the room...
We'd want bundles that allow enough control for users but aren't too much of a
hassle to configure. There will be trade-offs.
So, what do we think of this strawman for proposed bundles:
tika-classic-parser-bundle/
Tika-office-parser-bundle/ (including microsoft, opendocument, pst,
rtf, iwork? Has dependency on html/text)
Tika-pdf-parser-bundle/
Tika-text-parser-bundle (including txt,chm, rfc822, html, xml,
kml, feed, iptc, crypto, etc?)/
Tika-sourcecode-parser-bundle (parsers that handle source code)
Tika-package-parser-bundle (all zip/tar/etc)
tika-multimedia-parser-bundle/ (parsers that pull metadata out of image,
audio, audio+video files)
Tika-image-parser-bundle
Tika-image-ocr-parser-bundle
Tika-audio-parser-bundle
Tika-video-parser-bundle
tika-scientific-parser-bundle/ (all parsers that handle scientific data sets
(grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much hand-waving...input,
Chris?)
tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all parsers
that rely on native libs...unfortunately, this doesn't fit well thematically...)
tika-advanced-bundle/ (all parsers that rely on nlp or other advanced
techniques for extraction of information...
these aren't really just pulling text and metadata out, but are
operating on the text/metadata
once it has been pulled out. We may need separate bundles for
each?)
Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc.
...or maybe we want separate bundles for each?)
Tika-sentiment-parser-bundle (imaginary...?)
Tika-object-parser-bundle
Where to put these?
font parser
executable
mat
prt
strings
Cheers,
Tim
-----Original Message-----
From: Bob Paulin [mailto:[email protected]]
Sent: Tuesday, August 04, 2015 8:56 AM
To: [email protected]
Subject: Re: [DISCUSS] A more modular parser project
So I just tried adding a META-INF/services/org.apache.tika.parser.Parser
file to each bundle in the straw man implementation and it seemed to do
the trick. Looks like the ServiceLoader code searches the classloader
for all of these files and iterates through them to pick up each jar's
META-INF/services/org.apache.tika.parser.Parser entries and adds them to
the list. I've updated the code on github to include one per bundle.
This might be the way to go.
ex.
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-image-parser-bundle/src/main/resources/META-INF/services
- Bob
On 8/3/2015 9:21 PM, Allison, Timothy B. wrote:
>>> +1 to moving the source to bundles. I think for a 2.0 would be easier
> to consolidate into a parser uber jar than trying to tease things out
> like I did in the straw man impl. However deciding how to break things
> up might take some experimentation.
>
> Y, and the strawman is a great easy entry down this path towards 2.0. I
> think the main hangup will be coming to consensus about granularity and
> nature of the packages, but we can burn that bridge when we get to it. There
> are some dependencies between parsers, but we can work through that.
>
>>> 1) To spin up the GUI you need org.apache.tika.parser.util (perhaps
> consider moving this up to core).
> Y, I put that in tika-parsers because it relies on commons codec, and I
> wanted to keep that dependency out of tika-core. But, I'm willing to add it
> to tika-core if there aren't objections.
>
>
>>> 2) Since the META-INF/services/org.apache.tika.parser.Parser is in
> tika-parser we'd need to rethink the static ServiceLoader strategy to
> either always be dynamic or figure out a way to have each jar bring
> there own static loader.
>
> Hmmm...is there a way to specify this in one overall tika-config file or in
> separate configs in each bundle (yuck)...
>