Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Luís Filipe Nassif Thu, 26 Nov 2020 12:42:14 -0800

Yes, tika-server is the long way choice, as discussed in user's list recent
thread. I hope I will have time in the future to migrate to it to get rid
of jar hell problems definitely...


Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:

> I created a tika fork example I want to add to the documentation as well:
> https://github.com/nddipiazza/tika-fork-parser-example
>
> When we submit your fixes, we should update this example with
> multi-threading.
>
> On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
> nicholas.dipia...@gmail.com> wrote:
>
>> Hey Luis,
>>
>> It is related because after your fixes I might be able to take some
>> significant performance advantage by switching to fork parser.
>> I would make great use of an example of someone else who has set up a
>> ForkParser multi-thread able processing program that can gracefully handle
>> the huge onslaught that is my use case.
>> But at this point, I doubt I'll switch from Tika Server anyways because I
>> invested some time creating a wrapper around it and it is performing very
>> well.
>>
>> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lfcnas...@gmail.com>
>> wrote:
>>
>>> Not what you asked but related :)
>>>
>>> Luis
>>>
>>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lfcnas...@gmail.com
>>> >
>>> escreveu:
>>>
>>> > I've done some few improvements in ForkParser performance in an
>>> internal
>>> > fork. Will try to contribute upstream...
>>> >
>>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
>>> > nicholas.dipia...@gmail.com> escreveu:
>>> >
>>> >> I am attempting to Tika parse dozens of millions of office documents.
>>> >> Pdfs,
>>> >> docs, excels, xmls, etc. Wide assortment of types.
>>> >>
>>> >> Throughput is very important. I need to be able parse these files in a
>>> >> reasonable amount of time, but at the same time, accuracy is also
>>> pretty
>>> >> important. I hope to have less than 10% of the documents parsed fail.
>>> (And
>>> >> by fail I mean fail due to tika stability, like a timeout while
>>> parsing. I
>>> >> do not mean fail due to the document itself).
>>> >>
>>> >> My question - how to configure Tika Server in a containerized
>>> environment
>>> >> to maximize throughput?
>>> >>
>>> >> My environment:
>>> >>
>>> >>    - I am using Openshift.
>>> >>    - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory:
>>> *8
>>> >>    GiB to 10 GiB*.
>>> >>    - I have 10 tika parsing pod replicas.
>>> >>
>>> >> On each pod, I run a java program where I have 8 parse threads.
>>> >>
>>> >> Each thread:
>>> >>
>>> >>    - Starts a single tika server process (in spawn child mode)
>>> >>       - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>>> >> 120000
>>> >>       -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis
>>> 500
>>> >>       -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures
>>> >> -enableFileUrl
>>> >>    - The thread will now continuously grab a file from the
>>> files-to-fetch
>>> >>    queue and will send it to the tika server, stopping when there are
>>> no
>>> >> more
>>> >>    files to parse.
>>> >>
>>> >> Each of these files are stored locally on the pod in a buffer, so the
>>> >> local
>>> >> file optimization is used:
>>> >>
>>> >> The Tika web service it is using is:
>>> >>
>>> >> Endpoint: `/rmeta/text`
>>> >> Method: `PUT`
>>> >> Headers:    - writeLimit = 32000000    - maxEmbeddedResources = 0    -
>>> >> fileUrl = file:///path/to/file
>>> >>
>>> >> Files are no greater than 100Mb, the maximum number of bytes tika text
>>> >> will
>>> >> be (writeLimit) 32Mb.
>>> >>
>>> >> Each pod is parsing about 370,000 documents per day. I've been messing
>>> >> with
>>> >> a ton of different attempts at settings.
>>> >>
>>> >> I previously tried to use the actual Tika "ForkParser" but the
>>> performance
>>> >> was far worse than spawning tika servers. So that is why I am using
>>> Tika
>>> >> Server.
>>> >>
>>> >> I don't hate the performance results of this.... but I feel like I'd
>>> >> better
>>> >> reach out and make sure there isn't someone out there who sanity
>>> checks my
>>> >> numbers and is like "woah that's awful performance, you should be
>>> getting
>>> >> xyz like me!"
>>> >>
>>> >> Anyone have any similar things you are doing? If so, what settings
>>> did you
>>> >> end up settling on?
>>> >>
>>> >> Also, I'm wondering if Apache Http Client would be causing any
>>> overhead
>>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
>>> using
>>> >> a
>>> >> shared connection pool. Would there be any benefit in say using a
>>> unique
>>> >> HttpClients.createDefault() for each thread instead of sharing a
>>> >> connection
>>> >> pool between the threads?
>>> >>
>>> >>
>>> >> Cross posted question here as well
>>> >>
>>> >>
>>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>> >>
>>> >
>>>
>>

Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

Reply via email to