Hi Ryan,

Am 17.01.26 um 04:52 schrieb Ryan Schmitt:
I tried `1.0.0-SNAPSHOT` but it was only about one second faster. I used
async-profiler to grab a quick flame graph for 1.0.0-SNAPSHOT and I see a
tremendous amount of time being spent in Tika charset detection (*not* MIME
type detection -- it specifically looks like charset detection), along with
lots of regex matching time of course (lots of which is for copyright
scanning). Is there anything I can do on my end to speed this up? Is anyone
working on parallelizing processing (RAT-340)? Can charset detection be
optimized somehow?
thanks for diving into the regression - at the moment we are working on a 0.18 bugfix release and in the background we are changing the whole module structure/architecture of RAT (will become 1.0.0).

As we introduced Tika to detect files we cannot gauge if there's a problem withinin Tika or the way RAT uses Tika to detect charsets and MIME-types.

Feel free to create PRs via Github for little improvements you see :)

Cheers,
Phil

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to