All,
As Brian pointed out, optimaize is no longer maintained, and it has
some dependencies that have aged out. Should we replace our baseline
langdetect in tika-app and tika-server in 3.x?
I'd say that we should go with our OpenNLP based language detection,
but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
Java 17.
Thoughts?
Best,
Tim
---------- Forwarded message ---------
From: Brian Laskey <[email protected]>
Date: Fri, Mar 8, 2024 at 2:38 PM
Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
and parsers
To: [email protected] <[email protected]>
Hi Tim
Thanks this is helpful.
For tika-app we found the dependency on org.apache.tika »
tika-langdetect-optimaize brings in some older 3rd party jars, and
unfortunately it appears that the com.optimaize.languagedetector »
language-detector 0.6 is unmaintained so it’s dependencies on
vulnerable versions of guava (18.0) cause us problems with security
scans. I could be wrong but I don’t believe we need this component for
our usage of just detect and parse?
We have a sort of microservice process (java based) which is ingesting
files parsed from tika. It was nice that we could separate the tika
process in it’s own heap space as a separate java process rather than
adding it to our app, but I suppose we could work around that
Thank you
Brian Laskey
From: Tim Allison <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, March 8, 2024 at 9:44 AM
To: "[email protected]" <[email protected]>
Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
tiki-core / and parsers
Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
tika-parsers-standard-package. Which components are you trying to
avoid? tika-serialization and jackson? boilerpipecontenthandler and
some of its dependencies? I ask, because we
Hi Brian,
A few thoughts:
1) tika-app is basically tika-core + tika-parsers-standard-package.
Which components are you trying to avoid? tika-serialization and
jackson? boilerpipecontenthandler and some of its dependencies? I ask,
because we could factor out a tika-app-core with no parsers in Tika
3.x, which is what we do now with tika-server-core and
tika-server-standard.
2) Unrelated, there are probably more efficient ways of running Tika
than calling it per file on the commandline. That is a robust option,
at least!
If all you want is detect and text extraction, and you want to run it
from the commandline, write two classes, whose main()s call:
System.out.println(Tika.detect(File f));
or
System.out.println(Tika.parseToString(File f))
On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey <[email protected]> wrote:
Hello Tika community,
Our team is migrating away from usage of tika-app.jar (2.6 currently)
to something with more minimal third party dependencies which we can
control.
Is there any good documentation or pathway to describe how a team
could map the tika-app functionality we use to the same behavior using
just tika-core and tika-parsers-standard-package
(I assume)?
The tika-app functions we use today are:
Mime-type detection
java -jar tika-app.jar -d <file>
and
Text extraction attempts
java -jar tika-app.jar -t <file>
Is there a subset of tika parser jars we would need to include to have
equivalent functionality if we wrote our own wrapper main class?
Thank you,
Brian Laskey