Hi all As you may know I am running regression tests for any LT changes I make by checking huge Ukrainian media/book archives I collected over time. The full check for 6 text files I check was taking more than an hour on my i3 so I upgraded to i7. The time got under 60 min which was much better but while LT was using 40% of CPU on i3 (with 4 cores) it used only 30% of i7 (with 8 cores) so I always wanted to take another shot at our multithreaded logic.
I knew from previous research that collecting all results from analyze and check callables was stalling the threads so a lot of time in threading code was spent waiting rather than running. We have 3 main parts of the process: reading from file and splitting into paragraphs (1 thread), analyzing sentences (multithreaded), and checking the rules (multithreaded). Ideally this process should be changed into constant flow, i.e. reading paragraphs should feed (for example using streams or data flow) into analyze threads, and analyze threads should feed into check threads without stopping and syncing but that requires significant change in the code we have. So i took a shot at using runOnFileInOneGo() for 100MB text file I have (currently that code path has problem of producing slightly different results that the runOnFileLineByLine() but for benchmarking it didn't matter). I have 24G of RAM so memory would not be an issue. But that didn't work - the test ran for much longer than I expected and I had to kill the process (there could be some inefficiencies we have if we feed the rules with thousands of sentences in one go). So I came from another angle and tried to analyze the run/wait time. And here's what I found on 50M file: Time: 165962ms for 485161 sentences (2923.3 sentences/sec) readTime: 32252, analyzeTime: 49078, checkTime: 87478, analyzeWait: 25947, checkWait: 75199, paragraphs: 72136 33% load Here readTime - time spend reading and splitting file, analyzeTime/checkTime - cumulative time spend analyzing and checking in threads, analyzeWait/checkWait - cumulative time between the first and last thread being done with analyze and check tasks. The *Wait time is not exactly "clean" time wasted (in real world you'll always have to wait) but shows the factor of some threads being done much earlier than the others. Then I realized that in the check method we split rules into callables and their count is # of cores available (in my case 8), as I have 347 rules this means each bucket is 43 rules and rules being not equal in complexity this could lead to quite unequal time for each thread. So I increased the size of the callable array to 4 * # of threads = 80 and got better results: Time: 131925ms for 485161 sentences (3677.6 sentences/sec) readTime: 30400, analyzeTime: 47572, checkTime: 53890, analyzeWait: 25720, checkWait: 23601, paragraphs: 72136, rules: 347 40% cpu load Quite a speedup, of course next thing I tried was to do # of callables = # of rules, to increase granularity of check callables to the max, this lead to very small checkWait but seemed to increase the costs of synchronization so performance went down a bit: Time: 138432ms for 485161 sentences (3504.7 sentences/sec) readTime: 26954, analyzeTime: 47162, checkTime: 64250, analyzeWait: 25990, checkWait: 12592, paragraphs: 72136, rules: 347 As there seem to be no magic generic formula due to different type of texts and different rules I ran several more benchmarks. The performance peaked at approximately # of callables == # of rules / 10 (in my case 35): Time: 131972ms for 485161 sentences (3676.2 sentences/sec) readTime: 30125, analyzeTime: 48199, checkTime: 53570, analyzeWait: 26005, checkWait: 21788, paragraphs: 72136, rules: 347 45% cpu load This gave ~26% performance increase (with 12% more cpu usage) on 50M book collection. I had similar effect for 100M file of newspaper archive: before (31% cpu load): Time: 296837ms for 610505 sentences (2056.7 sentences/sec) readTime: 33572, analyzeTime: 108632, checkTime: 161008, analyzeWait: 76633, checkWait: 137750, paragraphs: 161467 after (41% cpu load): readTime: 35746, analyzeTime: 106457, checkTime: 97078, analyzeWait: 76840, checkWait: 41871, paragraphs: 161467, rules: 347 Time: 239431ms for 610505 sentences (2549.8 sentences/sec) ~ 24% increase (+%10 to cpu load) I got similar result for small book (165 paragraphs) as well: 23% speed increase (2244 sent/sec vs 1825). I think the speedup will depend on type of texts, the type/number of rules and cpu. And as I didn't have a scientific number for # of callables I'd say we need to do some more benchmarking for other languages/cpus. Based on what we find we can either: 1) hardcode the best formula we find 2) provide system property so the users that care a lot about performance (or want to experiment) can provide their own value 3) combine both: provide default with formula and allow to override with property If you're willing to try this you need to change line 185 in MultiThreadedJLanguageTool final int threads = getThreadPoolSize(); e.g. to final int threads = allRules.size()/10; Please let me know if you can confirm this and we can decide the way to go from here. Thanks, Andriy P.S. sorry for the long email :), I was hoping this may help if somebody wants to take a deeper dive into LT's parallel processing ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel