W dniu 2015-02-22 o 15:24, Andriy Rysin pisze: > On 02/22/2015 04:45 AM, Marcin Miłkowski wrote: >> Hi, >> >> >> W dniu 2015-02-21 o 19:22, Andriy Rysin pisze: >>> So the main problem with this performance improvement is that we read >>> across paragraphs. There are two problems with this: >>> 1) error context shows sentences from another paragraph: >>> I almost worked out a solution for that by adjusting ContextTools but >>> then I found the next one: >>> 2) the cross-sentence rules start to work across paragraphs >>> >>> and when I was analyzing the code I found that if we read from the >>> file and it's smaller than 64k we don't parse it by paragraphs. So the >>> cross-sentence rules work across paragraph here too. >> Let me explain why it works like this. This is because we had a 64K >> limit in the past, and I needed to check larger files. So whenever we >> have a larger file, I devised a rough input buffer code. >> >> But this is a dirty solution. I think we should get a proper solution >> with an Iterator over a file buffer. Now, the iterator class should be >> sensitive to the paragraph setting (\n or \n\n). I guess we could simply >> send an extra special token to the tokenizer or something like that so >> that we get the PARA_END tag whenever we get to the paragraph boundary. >> I understand that the performance is crippled when we wait until we find >> the whole paragraph? > Not quite, the problem is that we feel the analyze/check worker threads > with small chunks of data. I did some stats yesterday and realized that > 4 biggest text files I had to run regression on had ~40% of paragraphs > with 1-3 sentences. Those are printed media archives and I guess a case > of 1 is for chapter titles, newspaper titles etc, author, date etc. When > this happens (and you have i.e. 4 cpus) some of the "analyze sentence" > worker threads stay idle and also if we invoke "check" threads on only > couple of sentences the splitting in threads may produce more overhead > than benefit (not sure about this as if you have very big number of rule > it may still be faster). > When I removed the "checkpoint" at the paragraph level and always send > 64k blocks to worker threads (ignoring some regressions) my cpu idle > state goes from 40% to 10% (and those 10% are because worker threads > wait for main to read the file and tokenize - we could theoretically > optimize that one too). > I actually have one book which has much longer paragraphs and when I > test it cpus are much less idle. > >> Right now the SRX segmentation handles the line ends as well, so we >> would need to look at the connection to the sentence splitter. >> >>> This can be observed in MainTest.testEnglishFile() which gives 3 >>> matches vs MainTest.testEnglishStdIn4() which reads the same text but >>> using stdin gives 4. >> And it should give 3, right? Paragraph-level rules are doing something >> useful. > It depends, it should give 3 if paragraph-level rules should not work > across paragraph boundaries, and it should give 4 if it should work across. > >>> If we are to fix the "small file" case by splitting paragraph would it >>> make sense to remove special hanlding for small files? If it's small >>> it would be checked fast anyway and removing extra if/else blocks >>> would clean up the code logic... >> I think we should seal the logic in an iterator, and it would work the >> same way for all cases. > So to move on with more optimizations (by sending bigger blocks to > worker threads) we need several things: > 1) agree if paragraph-level rules should work across paragraphs,
They should not, at least they were designed to work in a single paragraph only. That was my idea back then. We could have rules that work on the whole file level if we want to have it across paragraphs. > if yes > there's not much extra work, if no then we have to make sure paragraph > boundaries are set by sentence tokenizer rather than file reader, and > add logic to the paragraph-level rules to stop at paragraph boundary; it > seems that sentence tokenizer already adds newline at the end of last > sentence in paragraph but I gave up before fully understand how it's set > and used. Actually, as far as I remember, the tokenizer does not add any code for paragraphs. It just splits a sentence on one or on two newlines, depending on what you set on the command-line (using -b). I believe the paragraph code is added in the JLanguageTool.GetAnalyzedSentence(). It just needs to know how many newlines make a paragraph. But I didn't read the current code and I simply remember what I wanted to code. Maybe I did some dirty hack... > IMHO splitting text into paragraphs should not be in commandline/file > reader but in the core logic (e.g. sentence tokenizer) It's not in the reader, IMHO, right now. > > If we agree on this we can also merge the code for small/large file to > be the same. Yes, I guess we should buffer the lines to get a big chunk for checking. Actually, I need to design a similar solution for the server I'm calling via HTTP in a Python script: I have thousands of small chunks that I need to glue together for performance reasons. > > 2) agree if match context information can contain sentences from other > paragraphs, if we don't do anything it will, if we want to remove other > paragraph sentences from context I have some relatively small code that > does it for plain text output but I need more code to make it work for > api output I think we should display the same paragraph. Basically, it would be much more helpful to get a whole sentence in the API output as well. Marcin > > Andriy > >> Regards, >> Marcin >> >>> Thanks >>> Andriy >>> >>> 2015-02-20 9:00 GMT-05:00 Andriy Rysin <ary...@gmail.com>: >>>> So before wrapping these optimizations up I decided to take a last >>>> look at the thread graph in jvisualvm and it showed that the worker >>>> threads spend more time in park state then in running. But the graph >>>> was really not showing why, it was more like a noodle soup. So I >>>> brought one of my past optimization back in: to always read file in >>>> big blocks (don't start analyze/check on each paragraph break), this >>>> made the thread graph very clear: besides waiting for main thread to >>>> prepare sentences the check threads run times were not equal (we had >>>> equal amount of rules per thread which does not actually amount to the >>>> same load). So I've added another of my test optimizations which >>>> didn't help before: creating a callable for each rule rather than for >>>> group of rules. >>>> The result: my cpu idle state went from 40% to 10% (now pretty much >>>> all of those 10% is in main thread, we could optimize it too but will >>>> have to refactor our workflow a bit). The speed went up from ~2500 >>>> (~1900 originally before previous optimizations) to ~2700 sentences/s. >>>> With this change adding more threads than cpus don't help (actually >>>> decreases performance) so we could probably get rid of the new >>>> internal property. >>>> >>>> Just to note: there's slight change in output: as we don't split the >>>> check on each paragraph change in the output some sentences with >>>> errors will have the beginning of the next sentence (beyond paragraph >>>> break). Hopefully it's not a big deal. >>>> >>>> I will need to work on cleaning thigs up, add changes for >>>> SameRuleGroupFilter and then will create another branch for everybody >>>> to test it out. >>>> >>>> Andriy >>>> >>>> 2015-02-20 8:10 GMT-05:00 Daniel Naber <daniel.na...@languagetool.org>: >>>>> On 2015-02-19 22:16, Andriy Rysin wrote: >>>>> >>>>>> I've merged multithreading branch into master. Please try it out when >>>>>> you have a chance and let me know if you see any issues. >>>>> Thanks. Some small cleanup ideas: >>>>> >>>>> -setThreadPoolSize should probably be a parameter of the constructor, as >>>>> calling it after thread pool setup would fail anyway ("Thread pool >>>>> already initialized") >>>>> -Does newFixedThreadPool need to use lazy init? If it gets initialized >>>>> in the constructor, it can also be made final. >>>>> -It can be 'threadPool' I think, no need for the 'new' and 'fixed' in >>>>> the variable name. >>>>> >>>>> Regards >>>>> Daniel >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server >>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards >>>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more >>>>> Get technology previously reserved for billion-dollar corporations, FREE >>>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk >>>>> _______________________________________________ >>>>> Languagetool-devel mailing list >>>>> Languagetool-devel@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> ------------------------------------------------------------------------------ >>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server >>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards >>> with Interactivity, Sharing, Native Excel Exports, App Integration & more >>> Get technology previously reserved for billion-dollar corporations, FREE >>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >>> >>> >> >> ------------------------------------------------------------------------------ >> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server >> from Actuate! Instantly Supercharge Your Business Reports and Dashboards >> with Interactivity, Sharing, Native Excel Exports, App Integration & more >> Get technology previously reserved for billion-dollar corporations, FREE >> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel > > > ------------------------------------------------------------------------------ > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel