I am attempting to Tika parse dozens of millions of office documents. Pdfs,
docs, excels, xmls, etc. Wide assortment of types.

Throughput is very important. I need to be able parse these files in a
reasonable amount of time, but at the same time, accuracy is also pretty
important. I hope to have less than 10% of the documents parsed fail. (And
by fail I mean fail due to tika stability, like a timeout while parsing. I
do not mean fail due to the document itself).

My question - how to configure Tika Server in a containerized environment
to maximize throughput?

My environment:

   - I am using Openshift.
   - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
   GiB to 10 GiB*.
   - I have 10 tika parsing pod replicas.

On each pod, I run a java program where I have 8 parse threads.

Each thread:

   - Starts a single tika server process (in spawn child mode)
      - Tika server arguments: -s -spawnChild -maxChildStartupMillis 120000
      -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
      -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures -enableFileUrl
   - The thread will now continuously grab a file from the files-to-fetch
   queue and will send it to the tika server, stopping when there are no more
   files to parse.

Each of these files are stored locally on the pod in a buffer, so the local
file optimization is used:

The Tika web service it is using is:

Endpoint: `/rmeta/text`
Method: `PUT`
Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
fileUrl = file:///path/to/file

Files are no greater than 100Mb, the maximum number of bytes tika text will
be (writeLimit) 32Mb.

Each pod is parsing about 370,000 documents per day. I've been messing with
a ton of different attempts at settings.

I previously tried to use the actual Tika "ForkParser" but the performance
was far worse than spawning tika servers. So that is why I am using Tika
Server.

I don't hate the performance results of this.... but I feel like I'd better
reach out and make sure there isn't someone out there who sanity checks my
numbers and is like "woah that's awful performance, you should be getting
xyz like me!"

Anyone have any similar things you are doing? If so, what settings did you
end up settling on?

Also, I'm wondering if Apache Http Client would be causing any overhead
here when I am calling to my Tika Server /rmeta/text endpoint. I am using a
shared connection pool. Would there be any benefit in say using a unique
HttpClients.createDefault() for each thread instead of sharing a connection
pool between the threads?


Cross posted question here as well
https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput

Reply via email to