Hey Luis, It is related because after your fixes I might be able to take some significant performance advantage by switching to fork parser. I would make great use of an example of someone else who has set up a ForkParser multi-thread able processing program that can gracefully handle the huge onslaught that is my use case. But at this point, I doubt I'll switch from Tika Server anyways because I invested some time creating a wrapper around it and it is performing very well.
On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lfcnas...@gmail.com> wrote: > Not what you asked but related :) > > Luis > > Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lfcnas...@gmail.com> > escreveu: > > > I've done some few improvements in ForkParser performance in an internal > > fork. Will try to contribute upstream... > > > > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < > > nicholas.dipia...@gmail.com> escreveu: > > > >> I am attempting to Tika parse dozens of millions of office documents. > >> Pdfs, > >> docs, excels, xmls, etc. Wide assortment of types. > >> > >> Throughput is very important. I need to be able parse these files in a > >> reasonable amount of time, but at the same time, accuracy is also pretty > >> important. I hope to have less than 10% of the documents parsed fail. > (And > >> by fail I mean fail due to tika stability, like a timeout while > parsing. I > >> do not mean fail due to the document itself). > >> > >> My question - how to configure Tika Server in a containerized > environment > >> to maximize throughput? > >> > >> My environment: > >> > >> - I am using Openshift. > >> - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8 > >> GiB to 10 GiB*. > >> - I have 10 tika parsing pod replicas. > >> > >> On each pod, I run a java program where I have 8 parse threads. > >> > >> Each thread: > >> > >> - Starts a single tika server process (in spawn child mode) > >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis > >> 120000 > >> -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis 500 > >> -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures > >> -enableFileUrl > >> - The thread will now continuously grab a file from the > files-to-fetch > >> queue and will send it to the tika server, stopping when there are no > >> more > >> files to parse. > >> > >> Each of these files are stored locally on the pod in a buffer, so the > >> local > >> file optimization is used: > >> > >> The Tika web service it is using is: > >> > >> Endpoint: `/rmeta/text` > >> Method: `PUT` > >> Headers: - writeLimit = 32000000 - maxEmbeddedResources = 0 - > >> fileUrl = file:///path/to/file > >> > >> Files are no greater than 100Mb, the maximum number of bytes tika text > >> will > >> be (writeLimit) 32Mb. > >> > >> Each pod is parsing about 370,000 documents per day. I've been messing > >> with > >> a ton of different attempts at settings. > >> > >> I previously tried to use the actual Tika "ForkParser" but the > performance > >> was far worse than spawning tika servers. So that is why I am using Tika > >> Server. > >> > >> I don't hate the performance results of this.... but I feel like I'd > >> better > >> reach out and make sure there isn't someone out there who sanity checks > my > >> numbers and is like "woah that's awful performance, you should be > getting > >> xyz like me!" > >> > >> Anyone have any similar things you are doing? If so, what settings did > you > >> end up settling on? > >> > >> Also, I'm wondering if Apache Http Client would be causing any overhead > >> here when I am calling to my Tika Server /rmeta/text endpoint. I am > using > >> a > >> shared connection pool. Would there be any benefit in say using a unique > >> HttpClients.createDefault() for each thread instead of sharing a > >> connection > >> pool between the threads? > >> > >> > >> Cross posted question here as well > >> > >> > https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput > >> > > >