Yes, tika-server is the long way choice, as discussed in user's list recent thread. I hope I will have time in the future to migrate to it to get rid of jar hell problems definitely...
Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > I created a tika fork example I want to add to the documentation as well: > https://github.com/nddipiazza/tika-fork-parser-example > > When we submit your fixes, we should update this example with > multi-threading. > > On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza < > nicholas.dipia...@gmail.com> wrote: > >> Hey Luis, >> >> It is related because after your fixes I might be able to take some >> significant performance advantage by switching to fork parser. >> I would make great use of an example of someone else who has set up a >> ForkParser multi-thread able processing program that can gracefully handle >> the huge onslaught that is my use case. >> But at this point, I doubt I'll switch from Tika Server anyways because I >> invested some time creating a wrapper around it and it is performing very >> well. >> >> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif <lfcnas...@gmail.com> >> wrote: >> >>> Not what you asked but related :) >>> >>> Luis >>> >>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif <lfcnas...@gmail.com >>> > >>> escreveu: >>> >>> > I've done some few improvements in ForkParser performance in an >>> internal >>> > fork. Will try to contribute upstream... >>> > >>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < >>> > nicholas.dipia...@gmail.com> escreveu: >>> > >>> >> I am attempting to Tika parse dozens of millions of office documents. >>> >> Pdfs, >>> >> docs, excels, xmls, etc. Wide assortment of types. >>> >> >>> >> Throughput is very important. I need to be able parse these files in a >>> >> reasonable amount of time, but at the same time, accuracy is also >>> pretty >>> >> important. I hope to have less than 10% of the documents parsed fail. >>> (And >>> >> by fail I mean fail due to tika stability, like a timeout while >>> parsing. I >>> >> do not mean fail due to the document itself). >>> >> >>> >> My question - how to configure Tika Server in a containerized >>> environment >>> >> to maximize throughput? >>> >> >>> >> My environment: >>> >> >>> >> - I am using Openshift. >>> >> - Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: >>> *8 >>> >> GiB to 10 GiB*. >>> >> - I have 10 tika parsing pod replicas. >>> >> >>> >> On each pod, I run a java program where I have 8 parse threads. >>> >> >>> >> Each thread: >>> >> >>> >> - Starts a single tika server process (in spawn child mode) >>> >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis >>> >> 120000 >>> >> -pingPulseMillis 500 -pingTimeoutMillis 30000 -taskPulseMillis >>> 500 >>> >> -taskTimeoutMillis 120000 -JXmx512m -enableUnsecureFeatures >>> >> -enableFileUrl >>> >> - The thread will now continuously grab a file from the >>> files-to-fetch >>> >> queue and will send it to the tika server, stopping when there are >>> no >>> >> more >>> >> files to parse. >>> >> >>> >> Each of these files are stored locally on the pod in a buffer, so the >>> >> local >>> >> file optimization is used: >>> >> >>> >> The Tika web service it is using is: >>> >> >>> >> Endpoint: `/rmeta/text` >>> >> Method: `PUT` >>> >> Headers: - writeLimit = 32000000 - maxEmbeddedResources = 0 - >>> >> fileUrl = file:///path/to/file >>> >> >>> >> Files are no greater than 100Mb, the maximum number of bytes tika text >>> >> will >>> >> be (writeLimit) 32Mb. >>> >> >>> >> Each pod is parsing about 370,000 documents per day. I've been messing >>> >> with >>> >> a ton of different attempts at settings. >>> >> >>> >> I previously tried to use the actual Tika "ForkParser" but the >>> performance >>> >> was far worse than spawning tika servers. So that is why I am using >>> Tika >>> >> Server. >>> >> >>> >> I don't hate the performance results of this.... but I feel like I'd >>> >> better >>> >> reach out and make sure there isn't someone out there who sanity >>> checks my >>> >> numbers and is like "woah that's awful performance, you should be >>> getting >>> >> xyz like me!" >>> >> >>> >> Anyone have any similar things you are doing? If so, what settings >>> did you >>> >> end up settling on? >>> >> >>> >> Also, I'm wondering if Apache Http Client would be causing any >>> overhead >>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am >>> using >>> >> a >>> >> shared connection pool. Would there be any benefit in say using a >>> unique >>> >> HttpClients.createDefault() for each thread instead of sharing a >>> >> connection >>> >> pool between the threads? >>> >> >>> >> >>> >> Cross posted question here as well >>> >> >>> >> >>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >>> >> >>> > >>> >>