Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?
Yes, tika-server is the long way choice, as discussed in user's list recent thread. I hope I will have time in the future to migrate to it to get rid of jar hell problems definitely... Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > I created a tika fork example I want to add to the documentation as well: > https://github.com/nddipiazza/tika-fork-parser-example > > When we submit your fixes, we should update this example with > multi-threading. > > On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza < > nicholas.dipia...@gmail.com> wrote: > >> Hey Luis, >> >> It is related because after your fixes I might be able to take some >> significant performance advantage by switching to fork parser. >> I would make great use of an example of someone else who has set up a >> ForkParser multi-thread able processing program that can gracefully handle >> the huge onslaught that is my use case. >> But at this point, I doubt I'll switch from Tika Server anyways because I >> invested some time creating a wrapper around it and it is performing very >> well. >> >> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif >> wrote: >> >>> Not what you asked but related :) >>> >>> Luis >>> >>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif >> > >>> escreveu: >>> >>> > I've done some few improvements in ForkParser performance in an >>> internal >>> > fork. Will try to contribute upstream... >>> > >>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < >>> > nicholas.dipia...@gmail.com> escreveu: >>> > >>> >> I am attempting to Tika parse dozens of millions of office documents. >>> >> Pdfs, >>> >> docs, excels, xmls, etc. Wide assortment of types. >>> >> >>> >> Throughput is very important. I need to be able parse these files in a >>> >> reasonable amount of time, but at the same time, accuracy is also >>> pretty >>> >> important. I hope to have less than 10% of the documents parsed fail. >>> (And >>> >> by fail I mean fail due to tika stability, like a timeout while >>> parsing. I >>> >> do not mean fail due to the document itself). >>> >> >>> >> My question - how to configure Tika Server in a containerized >>> environment >>> >> to maximize throughput? >>> >> >>> >> My environment: >>> >> >>> >>- I am using Openshift. >>> >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: >>> *8 >>> >>GiB to 10 GiB*. >>> >>- I have 10 tika parsing pod replicas. >>> >> >>> >> On each pod, I run a java program where I have 8 parse threads. >>> >> >>> >> Each thread: >>> >> >>> >>- Starts a single tika server process (in spawn child mode) >>> >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis >>> >> 12 >>> >> -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis >>> 500 >>> >> -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures >>> >> -enableFileUrl >>> >>- The thread will now continuously grab a file from the >>> files-to-fetch >>> >>queue and will send it to the tika server, stopping when there are >>> no >>> >> more >>> >>files to parse. >>> >> >>> >> Each of these files are stored locally on the pod in a buffer, so the >>> >> local >>> >> file optimization is used: >>> >> >>> >> The Tika web service it is using is: >>> >> >>> >> Endpoint: `/rmeta/text` >>> >> Method: `PUT` >>> >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0- >>> >> fileUrl = file:///path/to/file >>> >> >>> >> Files are no greater than 100Mb, the maximum number of bytes tika text >>> >> will >>> >> be (writeLimit) 32Mb. >>> >> >>> >> Each pod is parsing about 370,000 documents per day. I've been messing >>> >> with >>> >> a ton of different attempts at settings. >>> >> >>> >> I previously tried to use the actual Tika "ForkParser" but the >>> performance >>> >> was far worse than spawning tika servers. So that is why I am using >>> Tika >>> >> Server. >>> >> >>> >> I don't hate the performance results of this but I feel like I'd >>> >> better >>> >> reach out and make sure there isn't someone out there who sanity >>> checks my >>> >> numbers and is like "woah that's awful performance, you should be >>> getting >>> >> xyz like me!" >>> >> >>> >> Anyone have any similar things you are doing? If so, what settings >>> did you >>> >> end up settling on? >>> >> >>> >> Also, I'm wondering if Apache Http Client would be causing any >>> overhead >>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am >>> using >>> >> a >>> >> shared connection pool. Would there be any benefit in say using a >>> unique >>> >> HttpClients.createDefault() for each thread instead of sharing a >>> >> connection >>> >> pool between the threads? >>> >> >>> >> >>> >> Cross posted question here as well >>> >> >>> >> >>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >>> >> >>> > >>> >>
Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?
I created a tika fork example I want to add to the documentation as well: https://github.com/nddipiazza/tika-fork-parser-example When we submit your fixes, we should update this example with multi-threading. On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza < nicholas.dipia...@gmail.com> wrote: > Hey Luis, > > It is related because after your fixes I might be able to take some > significant performance advantage by switching to fork parser. > I would make great use of an example of someone else who has set up a > ForkParser multi-thread able processing program that can gracefully handle > the huge onslaught that is my use case. > But at this point, I doubt I'll switch from Tika Server anyways because I > invested some time creating a wrapper around it and it is performing very > well. > > On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif > wrote: > >> Not what you asked but related :) >> >> Luis >> >> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif >> escreveu: >> >> > I've done some few improvements in ForkParser performance in an internal >> > fork. Will try to contribute upstream... >> > >> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < >> > nicholas.dipia...@gmail.com> escreveu: >> > >> >> I am attempting to Tika parse dozens of millions of office documents. >> >> Pdfs, >> >> docs, excels, xmls, etc. Wide assortment of types. >> >> >> >> Throughput is very important. I need to be able parse these files in a >> >> reasonable amount of time, but at the same time, accuracy is also >> pretty >> >> important. I hope to have less than 10% of the documents parsed fail. >> (And >> >> by fail I mean fail due to tika stability, like a timeout while >> parsing. I >> >> do not mean fail due to the document itself). >> >> >> >> My question - how to configure Tika Server in a containerized >> environment >> >> to maximize throughput? >> >> >> >> My environment: >> >> >> >>- I am using Openshift. >> >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: >> *8 >> >>GiB to 10 GiB*. >> >>- I have 10 tika parsing pod replicas. >> >> >> >> On each pod, I run a java program where I have 8 parse threads. >> >> >> >> Each thread: >> >> >> >>- Starts a single tika server process (in spawn child mode) >> >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis >> >> 12 >> >> -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis >> 500 >> >> -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures >> >> -enableFileUrl >> >>- The thread will now continuously grab a file from the >> files-to-fetch >> >>queue and will send it to the tika server, stopping when there are >> no >> >> more >> >>files to parse. >> >> >> >> Each of these files are stored locally on the pod in a buffer, so the >> >> local >> >> file optimization is used: >> >> >> >> The Tika web service it is using is: >> >> >> >> Endpoint: `/rmeta/text` >> >> Method: `PUT` >> >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0- >> >> fileUrl = file:///path/to/file >> >> >> >> Files are no greater than 100Mb, the maximum number of bytes tika text >> >> will >> >> be (writeLimit) 32Mb. >> >> >> >> Each pod is parsing about 370,000 documents per day. I've been messing >> >> with >> >> a ton of different attempts at settings. >> >> >> >> I previously tried to use the actual Tika "ForkParser" but the >> performance >> >> was far worse than spawning tika servers. So that is why I am using >> Tika >> >> Server. >> >> >> >> I don't hate the performance results of this but I feel like I'd >> >> better >> >> reach out and make sure there isn't someone out there who sanity >> checks my >> >> numbers and is like "woah that's awful performance, you should be >> getting >> >> xyz like me!" >> >> >> >> Anyone have any similar things you are doing? If so, what settings did >> you >> >> end up settling on? >> >> >> >> Also, I'm wondering if Apache Http Client would be causing any overhead >> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am >> using >> >> a >> >> shared connection pool. Would there be any benefit in say using a >> unique >> >> HttpClients.createDefault() for each thread instead of sharing a >> >> connection >> >> pool between the threads? >> >> >> >> >> >> Cross posted question here as well >> >> >> >> >> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >> >> >> > >> >
Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?
Hey Luis, It is related because after your fixes I might be able to take some significant performance advantage by switching to fork parser. I would make great use of an example of someone else who has set up a ForkParser multi-thread able processing program that can gracefully handle the huge onslaught that is my use case. But at this point, I doubt I'll switch from Tika Server anyways because I invested some time creating a wrapper around it and it is performing very well. On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif wrote: > Not what you asked but related :) > > Luis > > Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif > escreveu: > > > I've done some few improvements in ForkParser performance in an internal > > fork. Will try to contribute upstream... > > > > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < > > nicholas.dipia...@gmail.com> escreveu: > > > >> I am attempting to Tika parse dozens of millions of office documents. > >> Pdfs, > >> docs, excels, xmls, etc. Wide assortment of types. > >> > >> Throughput is very important. I need to be able parse these files in a > >> reasonable amount of time, but at the same time, accuracy is also pretty > >> important. I hope to have less than 10% of the documents parsed fail. > (And > >> by fail I mean fail due to tika stability, like a timeout while > parsing. I > >> do not mean fail due to the document itself). > >> > >> My question - how to configure Tika Server in a containerized > environment > >> to maximize throughput? > >> > >> My environment: > >> > >>- I am using Openshift. > >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8 > >>GiB to 10 GiB*. > >>- I have 10 tika parsing pod replicas. > >> > >> On each pod, I run a java program where I have 8 parse threads. > >> > >> Each thread: > >> > >>- Starts a single tika server process (in spawn child mode) > >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis > >> 12 > >> -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500 > >> -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures > >> -enableFileUrl > >>- The thread will now continuously grab a file from the > files-to-fetch > >>queue and will send it to the tika server, stopping when there are no > >> more > >>files to parse. > >> > >> Each of these files are stored locally on the pod in a buffer, so the > >> local > >> file optimization is used: > >> > >> The Tika web service it is using is: > >> > >> Endpoint: `/rmeta/text` > >> Method: `PUT` > >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0- > >> fileUrl = file:///path/to/file > >> > >> Files are no greater than 100Mb, the maximum number of bytes tika text > >> will > >> be (writeLimit) 32Mb. > >> > >> Each pod is parsing about 370,000 documents per day. I've been messing > >> with > >> a ton of different attempts at settings. > >> > >> I previously tried to use the actual Tika "ForkParser" but the > performance > >> was far worse than spawning tika servers. So that is why I am using Tika > >> Server. > >> > >> I don't hate the performance results of this but I feel like I'd > >> better > >> reach out and make sure there isn't someone out there who sanity checks > my > >> numbers and is like "woah that's awful performance, you should be > getting > >> xyz like me!" > >> > >> Anyone have any similar things you are doing? If so, what settings did > you > >> end up settling on? > >> > >> Also, I'm wondering if Apache Http Client would be causing any overhead > >> here when I am calling to my Tika Server /rmeta/text endpoint. I am > using > >> a > >> shared connection pool. Would there be any benefit in say using a unique > >> HttpClients.createDefault() for each thread instead of sharing a > >> connection > >> pool between the threads? > >> > >> > >> Cross posted question here as well > >> > >> > https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput > >> > > >
Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?
Not what you asked but related :) Luis Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif escreveu: > I've done some few improvements in ForkParser performance in an internal > fork. Will try to contribute upstream... > > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < > nicholas.dipia...@gmail.com> escreveu: > >> I am attempting to Tika parse dozens of millions of office documents. >> Pdfs, >> docs, excels, xmls, etc. Wide assortment of types. >> >> Throughput is very important. I need to be able parse these files in a >> reasonable amount of time, but at the same time, accuracy is also pretty >> important. I hope to have less than 10% of the documents parsed fail. (And >> by fail I mean fail due to tika stability, like a timeout while parsing. I >> do not mean fail due to the document itself). >> >> My question - how to configure Tika Server in a containerized environment >> to maximize throughput? >> >> My environment: >> >>- I am using Openshift. >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8 >>GiB to 10 GiB*. >>- I have 10 tika parsing pod replicas. >> >> On each pod, I run a java program where I have 8 parse threads. >> >> Each thread: >> >>- Starts a single tika server process (in spawn child mode) >> - Tika server arguments: -s -spawnChild -maxChildStartupMillis >> 12 >> -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500 >> -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures >> -enableFileUrl >>- The thread will now continuously grab a file from the files-to-fetch >>queue and will send it to the tika server, stopping when there are no >> more >>files to parse. >> >> Each of these files are stored locally on the pod in a buffer, so the >> local >> file optimization is used: >> >> The Tika web service it is using is: >> >> Endpoint: `/rmeta/text` >> Method: `PUT` >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0- >> fileUrl = file:///path/to/file >> >> Files are no greater than 100Mb, the maximum number of bytes tika text >> will >> be (writeLimit) 32Mb. >> >> Each pod is parsing about 370,000 documents per day. I've been messing >> with >> a ton of different attempts at settings. >> >> I previously tried to use the actual Tika "ForkParser" but the performance >> was far worse than spawning tika servers. So that is why I am using Tika >> Server. >> >> I don't hate the performance results of this but I feel like I'd >> better >> reach out and make sure there isn't someone out there who sanity checks my >> numbers and is like "woah that's awful performance, you should be getting >> xyz like me!" >> >> Anyone have any similar things you are doing? If so, what settings did you >> end up settling on? >> >> Also, I'm wondering if Apache Http Client would be causing any overhead >> here when I am calling to my Tika Server /rmeta/text endpoint. I am using >> a >> shared connection pool. Would there be any benefit in say using a unique >> HttpClients.createDefault() for each thread instead of sharing a >> connection >> pool between the threads? >> >> >> Cross posted question here as well >> >> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >> >
Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?
I've done some few improvements in ForkParser performance in an internal fork. Will try to contribute upstream... Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza < nicholas.dipia...@gmail.com> escreveu: > I am attempting to Tika parse dozens of millions of office documents. Pdfs, > docs, excels, xmls, etc. Wide assortment of types. > > Throughput is very important. I need to be able parse these files in a > reasonable amount of time, but at the same time, accuracy is also pretty > important. I hope to have less than 10% of the documents parsed fail. (And > by fail I mean fail due to tika stability, like a timeout while parsing. I > do not mean fail due to the document itself). > > My question - how to configure Tika Server in a containerized environment > to maximize throughput? > > My environment: > >- I am using Openshift. >- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8 >GiB to 10 GiB*. >- I have 10 tika parsing pod replicas. > > On each pod, I run a java program where I have 8 parse threads. > > Each thread: > >- Starts a single tika server process (in spawn child mode) > - Tika server arguments: -s -spawnChild -maxChildStartupMillis 12 > -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500 > -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures > -enableFileUrl >- The thread will now continuously grab a file from the files-to-fetch >queue and will send it to the tika server, stopping when there are no > more >files to parse. > > Each of these files are stored locally on the pod in a buffer, so the local > file optimization is used: > > The Tika web service it is using is: > > Endpoint: `/rmeta/text` > Method: `PUT` > Headers:- writeLimit = 3200- maxEmbeddedResources = 0- > fileUrl = file:///path/to/file > > Files are no greater than 100Mb, the maximum number of bytes tika text will > be (writeLimit) 32Mb. > > Each pod is parsing about 370,000 documents per day. I've been messing with > a ton of different attempts at settings. > > I previously tried to use the actual Tika "ForkParser" but the performance > was far worse than spawning tika servers. So that is why I am using Tika > Server. > > I don't hate the performance results of this but I feel like I'd better > reach out and make sure there isn't someone out there who sanity checks my > numbers and is like "woah that's awful performance, you should be getting > xyz like me!" > > Anyone have any similar things you are doing? If so, what settings did you > end up settling on? > > Also, I'm wondering if Apache Http Client would be causing any overhead > here when I am calling to my Tika Server /rmeta/text endpoint. I am using a > shared connection pool. Would there be any benefit in say using a unique > HttpClients.createDefault() for each thread instead of sharing a connection > pool between the threads? > > > Cross posted question here as well > > https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput >