Hi,
W dniu 2015-02-21 o 19:22, Andriy Rysin pisze: > So the main problem with this performance improvement is that we read > across paragraphs. There are two problems with this: > 1) error context shows sentences from another paragraph: > I almost worked out a solution for that by adjusting ContextTools but > then I found the next one: > 2) the cross-sentence rules start to work across paragraphs > > and when I was analyzing the code I found that if we read from the > file and it's smaller than 64k we don't parse it by paragraphs. So the > cross-sentence rules work across paragraph here too. Let me explain why it works like this. This is because we had a 64K limit in the past, and I needed to check larger files. So whenever we have a larger file, I devised a rough input buffer code. But this is a dirty solution. I think we should get a proper solution with an Iterator over a file buffer. Now, the iterator class should be sensitive to the paragraph setting (\n or \n\n). I guess we could simply send an extra special token to the tokenizer or something like that so that we get the PARA_END tag whenever we get to the paragraph boundary. I understand that the performance is crippled when we wait until we find the whole paragraph? Right now the SRX segmentation handles the line ends as well, so we would need to look at the connection to the sentence splitter. > > This can be observed in MainTest.testEnglishFile() which gives 3 > matches vs MainTest.testEnglishStdIn4() which reads the same text but > using stdin gives 4. And it should give 3, right? Paragraph-level rules are doing something useful. > > If we are to fix the "small file" case by splitting paragraph would it > make sense to remove special hanlding for small files? If it's small > it would be checked fast anyway and removing extra if/else blocks > would clean up the code logic... I think we should seal the logic in an iterator, and it would work the same way for all cases. Regards, Marcin > > Thanks > Andriy > > 2015-02-20 9:00 GMT-05:00 Andriy Rysin <ary...@gmail.com>: >> So before wrapping these optimizations up I decided to take a last >> look at the thread graph in jvisualvm and it showed that the worker >> threads spend more time in park state then in running. But the graph >> was really not showing why, it was more like a noodle soup. So I >> brought one of my past optimization back in: to always read file in >> big blocks (don't start analyze/check on each paragraph break), this >> made the thread graph very clear: besides waiting for main thread to >> prepare sentences the check threads run times were not equal (we had >> equal amount of rules per thread which does not actually amount to the >> same load). So I've added another of my test optimizations which >> didn't help before: creating a callable for each rule rather than for >> group of rules. >> The result: my cpu idle state went from 40% to 10% (now pretty much >> all of those 10% is in main thread, we could optimize it too but will >> have to refactor our workflow a bit). The speed went up from ~2500 >> (~1900 originally before previous optimizations) to ~2700 sentences/s. >> With this change adding more threads than cpus don't help (actually >> decreases performance) so we could probably get rid of the new >> internal property. >> >> Just to note: there's slight change in output: as we don't split the >> check on each paragraph change in the output some sentences with >> errors will have the beginning of the next sentence (beyond paragraph >> break). Hopefully it's not a big deal. >> >> I will need to work on cleaning thigs up, add changes for >> SameRuleGroupFilter and then will create another branch for everybody >> to test it out. >> >> Andriy >> >> 2015-02-20 8:10 GMT-05:00 Daniel Naber <daniel.na...@languagetool.org>: >>> On 2015-02-19 22:16, Andriy Rysin wrote: >>> >>>> I've merged multithreading branch into master. Please try it out when >>>> you have a chance and let me know if you see any issues. >>> >>> Thanks. Some small cleanup ideas: >>> >>> -setThreadPoolSize should probably be a parameter of the constructor, as >>> calling it after thread pool setup would fail anyway ("Thread pool >>> already initialized") >>> -Does newFixedThreadPool need to use lazy init? If it gets initialized >>> in the constructor, it can also be made final. >>> -It can be 'threadPool' I think, no need for the 'new' and 'fixed' in >>> the variable name. >>> >>> Regards >>> Daniel >>> >>> >>> ------------------------------------------------------------------------------ >>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server >>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards >>> with Interactivity, Sharing, Native Excel Exports, App Integration & more >>> Get technology previously reserved for billion-dollar corporations, FREE >>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > ------------------------------------------------------------------------------ > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel