tika-user  

Re: parsing only specified content types in archive

Jukka Zitting
Thu, 10 Dec 2009 08:20:44 -0800

Hi,

On Fri, Dec 4, 2009 at 3:57 PM, Daniel Knapp
<daniel.kn...@mni.fh-giessen.de> wrote:
> is there an option to define the content types that should be parsed in an 
> archive file?
> for example i have a zip archive that contains jar and pdf files, tika should 
> only parse
> the pdf files and skip the rest.

If you use the Parser interface directly you can pass in a custom
CompositeParser instance in the ParseContext to explicitly control how
component documents within an archive get parsed. Something like this
should do the trick:

    CompositeParser composite = new CompositeParser();
    composite.setParsers(Collections.singletonMap(
        "application/pdf", (Parser) new PDFParser()));

    ParseContext context = new ParseContext();
    context.set(Parser.class, composite);

    new AutoDetectParser().parse(..., context);

> or is there an general option to define which content types should be parsed, 
> using
> the Tika.parse(...) facade.

You can modify the Tika configuration that you pass to the Tika
facade, but the same configuration applies both when you parse
top-level archive and any documents inside it, so this may not be
exactly what you're looking for.

BR,

Jukka Zitting