I just ran two quick experiments. I tried disabling the CharsetDetector and
just hardcoding UTF-8, and I also tried parallelizing DirectoryWalker so
that all files are analyzed in parallel. (It works in my setup as long
as SimpleXmlClaimReporter::report is declared `synchronized`.) Results:

Parallel, no CharsetDetector: 1.519s
Parallel, with CharsetDetector: 1.602s
Serial, no CharsetDetector: 3.329s
Serial, with CharsetDetector: 6.228s

I'm sure charset detection could be optimized somehow (for example, use
`new CharsetDetector(256)` instead of `new CharsetDetector()`, which
analyzes the first 12kB of every document), but clearly the single most
effective optimization here is going to be parallel processing.

On Fri, Jan 16, 2026 at 1:47 PM Ryan Schmitt <[email protected]> wrote:

> I recently upgraded httpcomponents-core from apache-rat 0.12 to 0.17 and
> have seen an increase in RatCheckMojo runtime from 0.911 seconds to 7.448
> seconds, as can be seen by comparing these Develocity build reports:
>
> - Before:
> https://scans.gradle.com/s/kaqbflny4crsc/timeline?toggled=WyIzMiJd&view=by-type
> - After:
> https://scans.gradle.com/s/qzx3nwz6iyn2g/timeline?toggled=WyIxMCJd&view=by-type
>
> I tried `1.0.0-SNAPSHOT` but it was only about one second faster. I used
> async-profiler to grab a quick flame graph for 1.0.0-SNAPSHOT and I see a
> tremendous amount of time being spent in Tika charset detection (*not* MIME
> type detection -- it specifically looks like charset detection), along with
> lots of regex matching time of course (lots of which is for copyright
> scanning). Is there anything I can do on my end to speed this up? Is anyone
> working on parallelizing processing (RAT-340)? Can charset detection be
> optimized somehow?
>

Reply via email to