I have an application of Tika server that I'm sure is pretty common.

I have parse nodes that download files from data sources, and will need to
parse out the content and metadata from these files. But it needs to be
resilient to OOM's and needs to time out gracefully.

Up until now. I've been using this project here:
https://github.com/nddipiazza/tika-fork to parse files. This manages a pool
of JVMs and pushes the requests through them. It makes it so if a file is a
bomb and blows up the JVM, it will not affect my program.

However, when I use this out in the wild, I get a lot of strange timeouts
that I can't reproduce locally.  Related to system resources on those local
systems I guess but I can't really figure out what the problem is.

So I'm thinking instead I will try out a different approach.

I would like to have each parser node have it's own Tika Server running,
and I'll just use the endpoint

http://localhost:9998/unpack/all

But I'm worried this will be plagued by the same problems that prompted me
to go to the tika-fork parser. Where this server will continually go down
due to OOMs because of random files in the wild that come in cause tika
bombs or cpu spikes due to infinite loops, etc.

How is everyone else managing to do this in the field? Is there a way to
configure a Tika Fork parser on the Tika server so that it does not crash
upon zip bombs, excel bombs, etc?

-Nicholas DiPiazza

Reply via email to