Hi Jukka,
On Sun, Sep 12, 2010 at 5:46 PM, Ken Krugler
<[email protected]> wrote:
But that also seems clunky. Any other suggestions?
A simpler approach would be to simply pass a list of already
instantiated Parser objects to AutoDetectParser, like this:
public AutoDetectParser(Detector detector, Parser... parsers) {
setDetector(detector);
Map<MediaType, Parser> map = new HashMap<MediaType, Parser>();
ParseContext context = new ParseContext();
for (Parser parser : parsers) {
for (MediaType type : parser.getSupportedTypes(context)) {
map.put(type, parser);
}
}
setParsers(map);
}
Thanks for the suggestion. This would work for the current 0.8 code
base, so I might just go ahead and add that.
But I found a few other places that called
TikaConfig.getDefaultConfig() besides AutoDetectParser():
- Tika()
- MediaTypeRegistry.getDefaultRegistry()
These don't seem to be used outside of test code, but I could easily
see people adding calls to them (and getDefaultConfig).
Depending on not having any calls to this from anywhere else in the
Tika sub-system seems fragile, so a more resilient solution would be
good. Especially since this is the second time this problem has bitten
me during a big parse job (20M+ documents).
-- Ken
BTW, the need to pass a MediaType->Parser map to
CompositeParser.setParsers() is a remnant of the time when we didn't
have the Parser.getSupportedTypes() method. Nowadays it would probably
be better to simply pass a collection of parsers and use
getSupportedTypes() calls for dispatch during CompositeParser.parse().
As an aside, what's the standard use case for specifying an explicit
classloader? I haven't seen this used in other projects, so I'm
curious.
See TIKA-419 [1] the relevant background.
[1] https://issues.apache.org/jira/browse/TIKA-419
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g