Hi all

As you may know I am running regression tests for any LT changes I
make by checking huge Ukrainian media/book archives I collected over
time. The full check for 6 text files I check was taking more than an
hour on my i3 so I upgraded to i7. The time got under 60 min which was
much better but while LT was using 40% of CPU on i3 (with 4 cores) it
used only 30% of i7 (with 8 cores) so I always wanted to take another
shot at our multithreaded logic.

I knew from previous research that collecting all results from analyze
and check callables was stalling the threads so a lot of time in
threading code was spent waiting rather than running.
We have 3 main parts of the process: reading from file and splitting
into paragraphs (1 thread), analyzing sentences (multithreaded), and
checking the rules (multithreaded).

Ideally this process should be changed into constant flow, i.e.
reading paragraphs should feed (for example using streams or data
flow) into analyze threads, and analyze threads should feed into check
threads without stopping and syncing but that requires significant
change in the code we have.

So i took a shot at using runOnFileInOneGo() for 100MB text file I
have (currently that code path has problem of producing slightly
different results that the runOnFileLineByLine() but for benchmarking
it didn't matter). I have 24G of RAM so memory would not be an issue.
But that didn't work - the test ran for much longer than I expected
and I had to kill the process (there could be some inefficiencies we
have if we feed the rules with thousands of sentences in one go).

So I came from another angle and tried to analyze the run/wait time.
And here's what I found on 50M file:

Time: 165962ms for 485161 sentences (2923.3 sentences/sec)
readTime: 32252, analyzeTime: 49078, checkTime: 87478, analyzeWait:
25947, checkWait: 75199, paragraphs: 72136
33% load

Here readTime - time spend reading and splitting file,
analyzeTime/checkTime - cumulative time spend analyzing and checking
in threads, analyzeWait/checkWait - cumulative time between the first
and last thread being done with analyze and check tasks. The *Wait
time is not exactly "clean" time wasted (in real world you'll always
have to wait) but shows the factor of some threads being done much
earlier than the others.

Then I realized that in the check method we split rules into callables
and their count is # of cores available (in my case 8), as I have 347
rules this means each bucket is 43 rules and rules being not equal in
complexity this could lead to quite unequal time for each thread.

So I increased the size of the callable array to 4 * # of threads = 80
and got better results:
Time: 131925ms for 485161 sentences (3677.6 sentences/sec)
readTime: 30400, analyzeTime: 47572, checkTime: 53890, analyzeWait:
25720, checkWait: 23601, paragraphs: 72136, rules: 347
40% cpu load

Quite a speedup, of course next thing I tried was to do # of callables
= # of rules, to increase granularity of check callables to the max,
this lead to very small checkWait but seemed to increase the costs of
synchronization so performance went down a bit:
Time: 138432ms for 485161 sentences (3504.7 sentences/sec)
readTime: 26954, analyzeTime: 47162, checkTime: 64250, analyzeWait:
25990, checkWait: 12592, paragraphs: 72136, rules: 347

As there seem to be no magic generic formula due to different type of
texts and different rules I ran several more benchmarks. The
performance peaked at approximately # of callables == # of rules / 10
(in my case 35):

Time: 131972ms for 485161 sentences (3676.2 sentences/sec)
readTime: 30125, analyzeTime: 48199, checkTime: 53570, analyzeWait:
26005, checkWait: 21788, paragraphs: 72136, rules: 347
45% cpu load

This gave ~26% performance increase (with 12% more cpu usage) on 50M
book collection. I had similar effect for 100M file of newspaper
archive:
before (31% cpu load):
Time: 296837ms for 610505 sentences (2056.7 sentences/sec)
readTime: 33572, analyzeTime: 108632, checkTime: 161008, analyzeWait:
76633, checkWait: 137750, paragraphs: 161467

after (41% cpu load):
readTime: 35746, analyzeTime: 106457, checkTime: 97078, analyzeWait:
76840, checkWait: 41871, paragraphs: 161467, rules: 347
Time: 239431ms for 610505 sentences (2549.8 sentences/sec)

~ 24% increase (+%10 to cpu load)

I got similar result for small book (165 paragraphs) as well: 23%
speed increase (2244 sent/sec vs 1825).

I think the speedup will depend on type of texts, the type/number of
rules and cpu. And as I didn't have a scientific number for # of
callables I'd say we need to do some more benchmarking for other
languages/cpus.

Based on what we find we can either:
1) hardcode the best formula we find
2) provide system property so the users that care a lot about
performance (or want to experiment) can provide their own value
3) combine both: provide default with formula and allow to override
with property

If you're willing to try this you need to change line 185 in
MultiThreadedJLanguageTool
    final int threads = getThreadPoolSize();
e.g. to
    final int threads = allRules.size()/10;

Please let me know if you can confirm this and we can decide the way
to go from here.

Thanks,
Andriy

P.S. sorry for the long email :), I was hoping this may help if
somebody wants to take a deeper dive into LT's parallel processing

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to