I just ran two quick experiments. I tried disabling the CharsetDetector and just hardcoding UTF-8, and I also tried parallelizing DirectoryWalker so that all files are analyzed in parallel. (It works in my setup as long as SimpleXmlClaimReporter::report is declared `synchronized`.) Results:
Parallel, no CharsetDetector: 1.519s Parallel, with CharsetDetector: 1.602s Serial, no CharsetDetector: 3.329s Serial, with CharsetDetector: 6.228s I'm sure charset detection could be optimized somehow (for example, use `new CharsetDetector(256)` instead of `new CharsetDetector()`, which analyzes the first 12kB of every document), but clearly the single most effective optimization here is going to be parallel processing. On Fri, Jan 16, 2026 at 1:47 PM Ryan Schmitt <[email protected]> wrote: > I recently upgraded httpcomponents-core from apache-rat 0.12 to 0.17 and > have seen an increase in RatCheckMojo runtime from 0.911 seconds to 7.448 > seconds, as can be seen by comparing these Develocity build reports: > > - Before: > https://scans.gradle.com/s/kaqbflny4crsc/timeline?toggled=WyIzMiJd&view=by-type > - After: > https://scans.gradle.com/s/qzx3nwz6iyn2g/timeline?toggled=WyIxMCJd&view=by-type > > I tried `1.0.0-SNAPSHOT` but it was only about one second faster. I used > async-profiler to grab a quick flame graph for 1.0.0-SNAPSHOT and I see a > tremendous amount of time being spent in Tika charset detection (*not* MIME > type detection -- it specifically looks like charset detection), along with > lots of regex matching time of course (lots of which is for copyright > scanning). Is there anything I can do on my end to speed this up? Is anyone > working on parallelizing processing (RAT-340)? Can charset detection be > optimized somehow? >
