Jukka Zitting
Thu, 10 Dec 2009 08:20:44 -0800
Hi, On Fri, Dec 4, 2009 at 3:57 PM, Daniel Knapp <daniel.kn...@mni.fh-giessen.de> wrote: > is there an option to define the content types that should be parsed in an > archive file? > for example i have a zip archive that contains jar and pdf files, tika should > only parse > the pdf files and skip the rest.
If you use the Parser interface directly you can pass in a custom
CompositeParser instance in the ParseContext to explicitly control how
component documents within an archive get parsed. Something like this
should do the trick:
CompositeParser composite = new CompositeParser();
composite.setParsers(Collections.singletonMap(
"application/pdf", (Parser) new PDFParser()));
ParseContext context = new ParseContext();
context.set(Parser.class, composite);
new AutoDetectParser().parse(..., context);
> or is there an general option to define which content types should be parsed,
> using
> the Tika.parse(...) facade.
You can modify the Tika configuration that you pass to the Tika
facade, but the same configuration applies both when you parse
top-level archive and any documents inside it, so this may not be
exactly what you're looking for.
BR,
Jukka Zitting