Yes, tika-server is the long way choice, as discussed in user's list recent
thread. I hope I will have time in the future to migrate to it to get rid
of jar hell problems definitely...
Em qui., 26 de nov. de 2020 às 14:32, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:
> I created a t
I created a tika fork example I want to add to the documentation as well:
https://github.com/nddipiazza/tika-fork-parser-example
When we submit your fixes, we should update this example with
multi-threading.
On Thu, Nov 26, 2020 at 11:28 AM Nicholas DiPiazza <
nicholas.dipia...@gmail.com> wrote:
Hey Luis,
It is related because after your fixes I might be able to take some
significant performance advantage by switching to fork parser.
I would make great use of an example of someone else who has set up a
ForkParser multi-thread able processing program that can gracefully handle
the huge ons
Not what you asked but related :)
Luis
Em qua, 25 de nov de 2020 23:20, Luís Filipe Nassif
escreveu:
> I've done some few improvements in ForkParser performance in an internal
> fork. Will try to contribute upstream...
>
> Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
> nicholas.dipia...
I've done some few improvements in ForkParser performance in an internal
fork. Will try to contribute upstream...
Em seg, 23 de nov de 2020 12:05, Nicholas DiPiazza <
nicholas.dipia...@gmail.com> escreveu:
> I am attempting to Tika parse dozens of millions of office documents. Pdfs,
> docs, excel
I am attempting to Tika parse dozens of millions of office documents. Pdfs,
docs, excels, xmls, etc. Wide assortment of types.
Throughput is very important. I need to be able parse these files in a
reasonable amount of time, but at the same time, accuracy is also pretty
important. I hope to have l