Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Luís Filipe Nassif
Yes, tika-server is the long way choice, as discussed in user's list recent
thread. I hope I will have time in the future to migrate to it to get rid
of jar hell problems definitely...

Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:

> I created a tika fork example I want to add to the documentation as well:
> https://github.com/nddipiazza/tika-fork-parser-example
>
> When we submit your fixes, we should update this example with
> multi-threading.
>
> On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
> nicholas.dipia...@gmail.com> wrote:
>
>> Hey Luis,
>>
>> It is related because after your fixes I might be able to take some
>> significant performance advantage by switching to fork parser.
>> I would make great use of an example of someone else who has set up a
>> ForkParser multi-thread able processing program that can gracefully handle
>> the huge onslaught that is my use case.
>> But at this point, I doubt I'll switch from Tika Server anyways because I
>> invested some time creating a wrapper around it and it is performing very
>> well.
>>
>> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif 
>> wrote:
>>
>>> Not what you asked but related :)
>>>
>>> Luis
>>>
>>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif >> >
>>> escreveu:
>>>
>>> > I've done some few improvements in ForkParser performance in an
>>> internal
>>> > fork. Will try to contribute upstream...
>>> >
>>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
>>> > nicholas.dipia...@gmail.com> escreveu:
>>> >
>>> >> I am attempting to Tika parse dozens of millions of office documents.
>>> >> Pdfs,
>>> >> docs, excels, xmls, etc. Wide assortment of types.
>>> >>
>>> >> Throughput is very important. I need to be able parse these files in a
>>> >> reasonable amount of time, but at the same time, accuracy is also
>>> pretty
>>> >> important. I hope to have less than 10% of the documents parsed fail.
>>> (And
>>> >> by fail I mean fail due to tika stability, like a timeout while
>>> parsing. I
>>> >> do not mean fail due to the document itself).
>>> >>
>>> >> My question - how to configure Tika Server in a containerized
>>> environment
>>> >> to maximize throughput?
>>> >>
>>> >> My environment:
>>> >>
>>> >>- I am using Openshift.
>>> >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory:
>>> *8
>>> >>GiB to 10 GiB*.
>>> >>- I have 10 tika parsing pod replicas.
>>> >>
>>> >> On each pod, I run a java program where I have 8 parse threads.
>>> >>
>>> >> Each thread:
>>> >>
>>> >>- Starts a single tika server process (in spawn child mode)
>>> >>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>>> >> 12
>>> >>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis
>>> 500
>>> >>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
>>> >> -enableFileUrl
>>> >>- The thread will now continuously grab a file from the
>>> files-to-fetch
>>> >>queue and will send it to the tika server, stopping when there are
>>> no
>>> >> more
>>> >>files to parse.
>>> >>
>>> >> Each of these files are stored locally on the pod in a buffer, so the
>>> >> local
>>> >> file optimization is used:
>>> >>
>>> >> The Tika web service it is using is:
>>> >>
>>> >> Endpoint: `/rmeta/text`
>>> >> Method: `PUT`
>>> >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
>>> >> fileUrl = file:///path/to/file
>>> >>
>>> >> Files are no greater than 100Mb, the maximum number of bytes tika text
>>> >> will
>>> >> be (writeLimit) 32Mb.
>>> >>
>>> >> Each pod is parsing about 370,000 documents per day. I've been messing
>>> >> with
>>> >> a ton of different attempts at settings.
>>> >>
>>> >> I previously tried to use the actual Tika "ForkParser" but the
>>> performance
>>> >> was far worse than spawning tika servers. So that is why I am using
>>> Tika
>>> >> Server.
>>> >>
>>> >> I don't hate the performance results of this but I feel like I'd
>>> >> better
>>> >> reach out and make sure there isn't someone out there who sanity
>>> checks my
>>> >> numbers and is like "woah that's awful performance, you should be
>>> getting
>>> >> xyz like me!"
>>> >>
>>> >> Anyone have any similar things you are doing? If so, what settings
>>> did you
>>> >> end up settling on?
>>> >>
>>> >> Also, I'm wondering if Apache Http Client would be causing any
>>> overhead
>>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
>>> using
>>> >> a
>>> >> shared connection pool. Would there be any benefit in say using a
>>> unique
>>> >> HttpClients.createDefault() for each thread instead of sharing a
>>> >> connection
>>> >> pool between the threads?
>>> >>
>>> >>
>>> >> Cross posted question here as well
>>> >>
>>> >>
>>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>> >>
>>> >
>>>
>>


Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Nicholas DiPiazza
I created a tika fork example I want to add to the documentation as well:
https://github.com/nddipiazza/tika-fork-parser-example

When we submit your fixes, we should update this example with
multi-threading.

On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:

> Hey Luis,
>
> It is related because after your fixes I might be able to take some
> significant performance advantage by switching to fork parser.
> I would make great use of an example of someone else who has set up a
> ForkParser multi-thread able processing program that can gracefully handle
> the huge onslaught that is my use case.
> But at this point, I doubt I'll switch from Tika Server anyways because I
> invested some time creating a wrapper around it and it is performing very
> well.
>
> On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif 
> wrote:
>
>> Not what you asked but related :)
>>
>> Luis
>>
>> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif 
>> escreveu:
>>
>> > I've done some few improvements in ForkParser performance in an internal
>> > fork. Will try to contribute upstream...
>> >
>> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
>> > nicholas.dipia...@gmail.com> escreveu:
>> >
>> >> I am attempting to Tika parse dozens of millions of office documents.
>> >> Pdfs,
>> >> docs, excels, xmls, etc. Wide assortment of types.
>> >>
>> >> Throughput is very important. I need to be able parse these files in a
>> >> reasonable amount of time, but at the same time, accuracy is also
>> pretty
>> >> important. I hope to have less than 10% of the documents parsed fail.
>> (And
>> >> by fail I mean fail due to tika stability, like a timeout while
>> parsing. I
>> >> do not mean fail due to the document itself).
>> >>
>> >> My question - how to configure Tika Server in a containerized
>> environment
>> >> to maximize throughput?
>> >>
>> >> My environment:
>> >>
>> >>- I am using Openshift.
>> >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory:
>> *8
>> >>GiB to 10 GiB*.
>> >>- I have 10 tika parsing pod replicas.
>> >>
>> >> On each pod, I run a java program where I have 8 parse threads.
>> >>
>> >> Each thread:
>> >>
>> >>- Starts a single tika server process (in spawn child mode)
>> >>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>> >> 12
>> >>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis
>> 500
>> >>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
>> >> -enableFileUrl
>> >>- The thread will now continuously grab a file from the
>> files-to-fetch
>> >>queue and will send it to the tika server, stopping when there are
>> no
>> >> more
>> >>files to parse.
>> >>
>> >> Each of these files are stored locally on the pod in a buffer, so the
>> >> local
>> >> file optimization is used:
>> >>
>> >> The Tika web service it is using is:
>> >>
>> >> Endpoint: `/rmeta/text`
>> >> Method: `PUT`
>> >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
>> >> fileUrl = file:///path/to/file
>> >>
>> >> Files are no greater than 100Mb, the maximum number of bytes tika text
>> >> will
>> >> be (writeLimit) 32Mb.
>> >>
>> >> Each pod is parsing about 370,000 documents per day. I've been messing
>> >> with
>> >> a ton of different attempts at settings.
>> >>
>> >> I previously tried to use the actual Tika "ForkParser" but the
>> performance
>> >> was far worse than spawning tika servers. So that is why I am using
>> Tika
>> >> Server.
>> >>
>> >> I don't hate the performance results of this but I feel like I'd
>> >> better
>> >> reach out and make sure there isn't someone out there who sanity
>> checks my
>> >> numbers and is like "woah that's awful performance, you should be
>> getting
>> >> xyz like me!"
>> >>
>> >> Anyone have any similar things you are doing? If so, what settings did
>> you
>> >> end up settling on?
>> >>
>> >> Also, I'm wondering if Apache Http Client would be causing any overhead
>> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
>> using
>> >> a
>> >> shared connection pool. Would there be any benefit in say using a
>> unique
>> >> HttpClients.createDefault() for each thread instead of sharing a
>> >> connection
>> >> pool between the threads?
>> >>
>> >>
>> >> Cross posted question here as well
>> >>
>> >>
>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>> >>
>> >
>>
>


Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-26 Thread Nicholas DiPiazza
Hey Luis,

It is related because after your fixes I might be able to take some
significant performance advantage by switching to fork parser.
I would make great use of an example of someone else who has set up a
ForkParser multi-thread able processing program that can gracefully handle
the huge onslaught that is my use case.
But at this point, I doubt I'll switch from Tika Server anyways because I
invested some time creating a wrapper around it and it is performing very
well.

On Wed, Nov 25, 2020 at 8:23 PM Luís Filipe Nassif 
wrote:

> Not what you asked but related :)
>
> Luis
>
> Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif 
> escreveu:
>
> > I've done some few improvements in ForkParser performance in an internal
> > fork. Will try to contribute upstream...
> >
> > Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> > nicholas.dipia...@gmail.com> escreveu:
> >
> >> I am attempting to Tika parse dozens of millions of office documents.
> >> Pdfs,
> >> docs, excels, xmls, etc. Wide assortment of types.
> >>
> >> Throughput is very important. I need to be able parse these files in a
> >> reasonable amount of time, but at the same time, accuracy is also pretty
> >> important. I hope to have less than 10% of the documents parsed fail.
> (And
> >> by fail I mean fail due to tika stability, like a timeout while
> parsing. I
> >> do not mean fail due to the document itself).
> >>
> >> My question - how to configure Tika Server in a containerized
> environment
> >> to maximize throughput?
> >>
> >> My environment:
> >>
> >>- I am using Openshift.
> >>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
> >>GiB to 10 GiB*.
> >>- I have 10 tika parsing pod replicas.
> >>
> >> On each pod, I run a java program where I have 8 parse threads.
> >>
> >> Each thread:
> >>
> >>- Starts a single tika server process (in spawn child mode)
> >>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis
> >> 12
> >>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500
> >>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
> >> -enableFileUrl
> >>- The thread will now continuously grab a file from the
> files-to-fetch
> >>queue and will send it to the tika server, stopping when there are no
> >> more
> >>files to parse.
> >>
> >> Each of these files are stored locally on the pod in a buffer, so the
> >> local
> >> file optimization is used:
> >>
> >> The Tika web service it is using is:
> >>
> >> Endpoint: `/rmeta/text`
> >> Method: `PUT`
> >> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
> >> fileUrl = file:///path/to/file
> >>
> >> Files are no greater than 100Mb, the maximum number of bytes tika text
> >> will
> >> be (writeLimit) 32Mb.
> >>
> >> Each pod is parsing about 370,000 documents per day. I've been messing
> >> with
> >> a ton of different attempts at settings.
> >>
> >> I previously tried to use the actual Tika "ForkParser" but the
> performance
> >> was far worse than spawning tika servers. So that is why I am using Tika
> >> Server.
> >>
> >> I don't hate the performance results of this but I feel like I'd
> >> better
> >> reach out and make sure there isn't someone out there who sanity checks
> my
> >> numbers and is like "woah that's awful performance, you should be
> getting
> >> xyz like me!"
> >>
> >> Anyone have any similar things you are doing? If so, what settings did
> you
> >> end up settling on?
> >>
> >> Also, I'm wondering if Apache Http Client would be causing any overhead
> >> here when I am calling to my Tika Server /rmeta/text endpoint. I am
> using
> >> a
> >> shared connection pool. Would there be any benefit in say using a unique
> >> HttpClients.createDefault() for each thread instead of sharing a
> >> connection
> >> pool between the threads?
> >>
> >>
> >> Cross posted question here as well
> >>
> >>
> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
> >>
> >
>


Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
Not what you asked but related :)

Luis

Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif 
escreveu:

> I've done some few improvements in ForkParser performance in an internal
> fork. Will try to contribute upstream...
>
> Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> nicholas.dipia...@gmail.com> escreveu:
>
>> I am attempting to Tika parse dozens of millions of office documents.
>> Pdfs,
>> docs, excels, xmls, etc. Wide assortment of types.
>>
>> Throughput is very important. I need to be able parse these files in a
>> reasonable amount of time, but at the same time, accuracy is also pretty
>> important. I hope to have less than 10% of the documents parsed fail. (And
>> by fail I mean fail due to tika stability, like a timeout while parsing. I
>> do not mean fail due to the document itself).
>>
>> My question - how to configure Tika Server in a containerized environment
>> to maximize throughput?
>>
>> My environment:
>>
>>- I am using Openshift.
>>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>>GiB to 10 GiB*.
>>- I have 10 tika parsing pod replicas.
>>
>> On each pod, I run a java program where I have 8 parse threads.
>>
>> Each thread:
>>
>>- Starts a single tika server process (in spawn child mode)
>>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis
>> 12
>>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500
>>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
>> -enableFileUrl
>>- The thread will now continuously grab a file from the files-to-fetch
>>queue and will send it to the tika server, stopping when there are no
>> more
>>files to parse.
>>
>> Each of these files are stored locally on the pod in a buffer, so the
>> local
>> file optimization is used:
>>
>> The Tika web service it is using is:
>>
>> Endpoint: `/rmeta/text`
>> Method: `PUT`
>> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
>> fileUrl = file:///path/to/file
>>
>> Files are no greater than 100Mb, the maximum number of bytes tika text
>> will
>> be (writeLimit) 32Mb.
>>
>> Each pod is parsing about 370,000 documents per day. I've been messing
>> with
>> a ton of different attempts at settings.
>>
>> I previously tried to use the actual Tika "ForkParser" but the performance
>> was far worse than spawning tika servers. So that is why I am using Tika
>> Server.
>>
>> I don't hate the performance results of this but I feel like I'd
>> better
>> reach out and make sure there isn't someone out there who sanity checks my
>> numbers and is like "woah that's awful performance, you should be getting
>> xyz like me!"
>>
>> Anyone have any similar things you are doing? If so, what settings did you
>> end up settling on?
>>
>> Also, I'm wondering if Apache Http Client would be causing any overhead
>> here when I am calling to my Tika Server /rmeta/text endpoint. I am using
>> a
>> shared connection pool. Would there be any benefit in say using a unique
>> HttpClients.createDefault() for each thread instead of sharing a
>> connection
>> pool between the threads?
>>
>>
>> Cross posted question here as well
>>
>> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>>
>


Re: How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

2020-11-25 Thread Luís Filipe Nassif
I've done some few improvements in ForkParser performance in an internal
fork. Will try to contribute upstream...

Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:

> I am attempting to Tika parse dozens of millions of office documents. Pdfs,
> docs, excels, xmls, etc. Wide assortment of types.
>
> Throughput is very important. I need to be able parse these files in a
> reasonable amount of time, but at the same time, accuracy is also pretty
> important. I hope to have less than 10% of the documents parsed fail. (And
> by fail I mean fail due to tika stability, like a timeout while parsing. I
> do not mean fail due to the document itself).
>
> My question - how to configure Tika Server in a containerized environment
> to maximize throughput?
>
> My environment:
>
>- I am using Openshift.
>- Each tika parsing pod has *CPU: 2 cores to 2 cores*, and Memory: *8
>GiB to 10 GiB*.
>- I have 10 tika parsing pod replicas.
>
> On each pod, I run a java program where I have 8 parse threads.
>
> Each thread:
>
>- Starts a single tika server process (in spawn child mode)
>   - Tika server arguments: -s -spawnChild -maxChildStartupMillis 12
>   -pingPulseMillis 500 -pingTimeoutMillis 3 -taskPulseMillis 500
>   -taskTimeoutMillis 12 -JXmx512m -enableUnsecureFeatures
> -enableFileUrl
>- The thread will now continuously grab a file from the files-to-fetch
>queue and will send it to the tika server, stopping when there are no
> more
>files to parse.
>
> Each of these files are stored locally on the pod in a buffer, so the local
> file optimization is used:
>
> The Tika web service it is using is:
>
> Endpoint: `/rmeta/text`
> Method: `PUT`
> Headers:- writeLimit = 3200- maxEmbeddedResources = 0-
> fileUrl = file:///path/to/file
>
> Files are no greater than 100Mb, the maximum number of bytes tika text will
> be (writeLimit) 32Mb.
>
> Each pod is parsing about 370,000 documents per day. I've been messing with
> a ton of different attempts at settings.
>
> I previously tried to use the actual Tika "ForkParser" but the performance
> was far worse than spawning tika servers. So that is why I am using Tika
> Server.
>
> I don't hate the performance results of this but I feel like I'd better
> reach out and make sure there isn't someone out there who sanity checks my
> numbers and is like "woah that's awful performance, you should be getting
> xyz like me!"
>
> Anyone have any similar things you are doing? If so, what settings did you
> end up settling on?
>
> Also, I'm wondering if Apache Http Client would be causing any overhead
> here when I am calling to my Tika Server /rmeta/text endpoint. I am using a
> shared connection pool. Would there be any benefit in say using a unique
> HttpClients.createDefault() for each thread instead of sharing a connection
> pool between the threads?
>
>
> Cross posted question here as well
>
> https://stackoverflow.com/questions/64950945/how-to-configure-apache-tika-in-a-kube-environment-to-obtain-maximum-throughput
>