Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Nicholas DiPiazza Thu, 26 Nov 2020 09:28:52 -0800

Hey Luis,

It is related because after your fixes I might be able to take some
significant performance advantage by switching to fork parser.
I would make great use of an example of someone else who has set up a
ForkParser multi-thread able processing program that can gracefully handle
the huge onslaught that is my use case.
But at this point, I doubt I'll switch from Tika Server anyways because I
invested some time creating a wrapper around it and it is performing very
well.


On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lfcnas...@gmail.com>
wrote:

> Not what you asked but related :)
>
> Luis
>
> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lfcnas...@gmail.com>
> escreveu:
>
> > I've done some few improvements in ForkParser performance in an internal
> > fork. Will try to contribute upstream...
> >
> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> > nicholas.dipia...@gmail.com> escreveu:
> >
> >> I am attempting to Tika parse dozens of millions of office documents.
> >> Pdfs,
> >> docs, excels, xmls, etc. Wide assortment of types.
> >>
> >> Throughput is very important. I need to be able parse these files in a
> >> reasonable amount of time, but at the same time, accuracy is also pretty
> >> important. I hope to have less than 10% of the documents parsed fail.
> (And
> >> by fail I mean fail due to tika stability, like a timeout while
> parsing. I
> >> do not mean fail due to the document itself).
> >>
> >> My question - how to configure Tika Server in a containerized
> environment
> >> to maximize throughput?
> >>
> >> My environment:
> >>
> >>    - I am using Openshift.
> >>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
> >>    GiB to 10 GiB*.
> >>    - I have 10 tika parsing pod replicas.
> >>
> >> On each pod, I run a java program where I have 8 parse threads.
> >>
> >> Each thread:
> >>
> >>    - Starts a single tika server process (in spawn child mode)
> >>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
> >> 120000
> >>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500
> >>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
> >> -enableFileUrl
> >>    - The thread will now continuously grab a file from the
> files-to-fetch
> >>    queue and will send it to the tika server, stopping when there are no
> >> more
> >>    files to parse.
> >>
> >> Each of these files are stored locally on the pod in a buffer, so the
> >> local
> >> file optimization is used:
> >>
> >> The Tika web service it is using is:
> >>
> >> Endpoint: `/rmeta/text`
> >> Method: `PUT`
> >> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
> >> fileUrl = file:///path/to/file
> >>
> >> Files are no greater than 100Mb, the maximum number of bytes tika text
> >> will
> >> be (writeLimit) 32Mb.
> >>
> >> Each pod is parsing about 370,000 documents per day. I've been messing
> >> with
> >> a ton of different attempts at settings.
> >>
> >> I previously tried to use the actual Tika "ForkParser" but the
> performance
> >> was far worse than spawning tika servers. So that is why I am using Tika
> >> Server.
> >>
> >> I don't hate the performance results of this.... but I feel like I'd
> >> better
> >> reach out and make sure there isn't someone out there who sanity checks
> my
> >> numbers and is like "woah that's awful performance, you should be
> getting
> >> xyz like me!"
> >>
> >> Anyone have any similar things you are doing? If so, what settings did
> you
> >> end up settling on?
> >>
> >> Also, I'm wondering if Apache Http Client would be causing any overhead
> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
> using
> >> a
> >> shared connection pool. Would there be any benefit in say using a unique
> >> HttpClients.createDefault() for each thread instead of sharing a
> >> connection
> >> pool between the threads?
> >>
> >>
> >> Cross posted question here as well
> >>
> >>
> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
> >>
> >
>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Reply via email to