Bob,
  Thank you, again.  This looks promising!

To continue down the strawman path and to start discussion on the elephant in 
the room...

We'd want bundles that allow enough control for users but aren't too much of a 
hassle to configure.  There will be trade-offs.

So, what do we think of this strawman for proposed bundles:

tika-classic-parser-bundle/
        Tika-office-parser-bundle/ (including microsoft, opendocument, pst, 
rtf, iwork? Has dependency on html/text) 
        Tika-pdf-parser-bundle/
                Tika-text-parser-bundle (including txt,chm, rfc822, html, xml, 
kml, feed, iptc, crypto, etc?)/
        Tika-sourcecode-parser-bundle (parsers that handle source code)
        Tika-package-parser-bundle (all zip/tar/etc)

tika-multimedia-parser-bundle/  (parsers that pull metadata out of image, 
audio, audio+video files)
        Tika-image-parser-bundle
        Tika-image-ocr-parser-bundle
        Tika-audio-parser-bundle
        Tika-video-parser-bundle

tika-scientific-parser-bundle/ (all parsers that handle scientific data sets 
        (grib, isatab,gdal,hdf,netcdf,geoinfo,dif...much hand-waving...input, 
Chris?)

tika-nativelib-parser-bundle/ (sqlite...any others at the moment? all parsers 
that rely on native libs...unfortunately, this doesn't fit well thematically...)

tika-advanced-bundle/ (all parsers that rely on nlp or other advanced 
techniques for extraction of information...
                these aren't really just pulling text and metadata out, but are 
operating on the text/metadata
                 once it has been pulled out.  We may need separate bundles for 
each?)
        Tika-nlp-parser-bundle/ (ctakes, phone number, geo.topic, grobid(?) etc.
                ...or maybe we want separate bundles for each?)
        Tika-sentiment-parser-bundle (imaginary...?)
        Tika-object-parser-bundle
        
Where to put these?
         font parser
        executable
        mat
        prt
        strings


Cheers,
 
               Tim



-----Original Message-----
From: Bob Paulin [mailto:[email protected]] 
Sent: Tuesday, August 04, 2015 8:56 AM
To: [email protected]
Subject: Re: [DISCUSS] A more modular parser project

So I just tried adding a META-INF/services/org.apache.tika.parser.Parser 
file to each bundle in the straw man implementation and it seemed to do 
the trick. Looks like the ServiceLoader code searches the classloader 
for all of these files and iterates through them to pick up each jar's 
META-INF/services/org.apache.tika.parser.Parser entries and adds them to 
the list.  I've updated the code on github to include one per bundle.  
This might be the way to go.

ex.
https://github.com/bobpaulin/tika/tree/trunk/tika-parser-bundles/tika-image-parser-bundle/src/main/resources/META-INF/services


- Bob

On 8/3/2015 9:21 PM, Allison, Timothy B. wrote:
>>> +1 to moving the source to bundles.  I think for a 2.0 would be easier
> to consolidate into a parser uber jar than trying to tease things out
> like I did in the straw man impl. However deciding how to break things
> up might take some experimentation.
>
> Y, and the strawman is a great easy entry down this path towards 2.0.  I 
> think the main hangup will be coming to consensus about granularity and 
> nature of the packages, but we can burn that bridge when we get to it.  There 
> are some dependencies between parsers, but we can work through that.
>
>>> 1) To spin up the GUI you need org.apache.tika.parser.util (perhaps
> consider moving this up to core).
> Y, I put that in tika-parsers because it relies on commons codec, and I 
> wanted to keep that dependency out of tika-core.  But, I'm willing to add it 
> to tika-core if there aren't objections.
>
>
>>> 2) Since the META-INF/services/org.apache.tika.parser.Parser is in
> tika-parser we'd need to rethink the static ServiceLoader strategy to
> either always be dynamic or figure out a way to have each jar bring
> there own static loader.
>
> Hmmm...is there a way to specify this in one overall tika-config file or in 
> separate configs in each bundle (yuck)...
>

Reply via email to