W dniu 2015-02-22 o 15:24, Andriy Rysin pisze:
> On 02/22/2015 04:45 AM, Marcin Miłkowski wrote:
>> Hi,
>>
>>
>> W dniu 2015-02-21 o 19:22, Andriy Rysin pisze:
>>> So the main problem with this performance improvement is that we read
>>> across paragraphs. There are two problems with this:
>>> 1) error context shows sentences from another paragraph:
>>> I almost worked out a solution for that by adjusting ContextTools but
>>> then I found the next one:
>>> 2) the cross-sentence rules start to work across paragraphs
>>>
>>> and when I was analyzing the code I found that if we read from the
>>> file and it's smaller than 64k we don't parse it by paragraphs. So the
>>> cross-sentence rules work across paragraph here too.
>> Let me explain why it works like this. This is because we had a 64K
>> limit in the past, and I needed to check larger files. So whenever we
>> have a larger file, I devised a rough input buffer code.
>>
>> But this is a dirty solution. I think we should get a proper solution
>> with an Iterator over a file buffer. Now, the iterator class should be
>> sensitive to the paragraph setting (\n or \n\n). I guess we could simply
>> send an extra special token to the tokenizer or something like that so
>> that we get the PARA_END tag whenever we get to the paragraph boundary.
>> I understand that the performance is crippled when we wait until we find
>> the whole paragraph?
> Not quite, the problem is that we feel the analyze/check worker threads
> with small chunks of data. I did some stats yesterday and realized that
> 4 biggest text files I had to run regression on had ~40% of paragraphs
> with 1-3 sentences. Those are printed media archives and I guess a case
> of 1 is for chapter titles, newspaper titles etc, author, date etc. When
> this happens (and you have i.e. 4 cpus) some of the "analyze sentence"
> worker threads stay idle and also if we invoke "check" threads on only
> couple of sentences the splitting in threads may produce more overhead
> than benefit (not sure about this as if you have very big number of rule
> it may still be faster).
> When I removed the "checkpoint" at the paragraph level and always send
> 64k blocks to worker threads (ignoring some regressions) my cpu idle
> state goes from 40% to 10% (and those 10% are because worker threads
> wait for main to read the file and tokenize - we could theoretically
> optimize that one too).
> I actually have one book which has much longer paragraphs and when I
> test it cpus are much less idle.
>
>> Right now the SRX segmentation handles the line ends as well, so we
>> would need to look at the connection to the sentence splitter.
>>
>>> This can be observed in MainTest.testEnglishFile() which gives 3
>>> matches vs MainTest.testEnglishStdIn4() which reads the same text but
>>> using stdin gives 4.
>> And it should give 3, right? Paragraph-level rules are doing something
>> useful.
> It depends, it should give 3 if paragraph-level rules should not work
> across paragraph boundaries, and it should give 4 if it should work across.
>
>>> If we are to fix the "small file" case by splitting paragraph would it
>>> make sense to remove special hanlding for small files? If it's small
>>> it would be checked fast anyway and removing extra if/else blocks
>>> would clean up the code logic...
>> I think we should seal the logic in an iterator, and it would work the
>> same way for all cases.
> So to move on with more optimizations (by sending bigger blocks to
> worker threads) we need several things:
> 1) agree if paragraph-level rules should work across paragraphs,

They should not, at least they were designed to work in a single 
paragraph only. That was my idea back then.

We could have rules that work on the whole file level if we want to have 
it across paragraphs.

> if yes
> there's not much extra work, if no then we have to make sure paragraph
> boundaries are set by sentence tokenizer rather than file reader, and
> add logic to the paragraph-level rules to stop at paragraph boundary; it
> seems that sentence tokenizer already adds newline at the end of last
> sentence in paragraph but I gave up before fully understand how it's set
> and used.

Actually, as far as I remember, the tokenizer does not add any code for 
paragraphs. It just splits a sentence on one or on two newlines, 
depending on what you set on the command-line (using -b).

I believe the paragraph code is added in the 
JLanguageTool.GetAnalyzedSentence(). It just needs to know how many 
newlines make a paragraph. But I didn't read the current code and I 
simply remember what I wanted to code. Maybe I did some dirty hack...

> IMHO splitting text into paragraphs should not be in commandline/file
> reader but in the core logic (e.g. sentence tokenizer)

It's not in the reader, IMHO, right now.

>
> If we agree on this we can also merge the code for small/large file to
> be the same.

Yes, I guess we should buffer the lines to get a big chunk for checking. 
Actually, I need to design a similar solution for the server I'm calling 
via HTTP in a Python script: I have thousands of small chunks that I 
need to glue together for performance reasons.

>
> 2) agree if match context information can contain sentences from other
> paragraphs, if we don't do anything it will, if we want to remove other
> paragraph sentences from context I have some relatively small code that
> does it for plain text output but I need more code to make it work for
> api output

I think we should display the same paragraph. Basically, it would be 
much more helpful to get a whole sentence in the API output as well.

Marcin

>
> Andriy
>
>> Regards,
>> Marcin
>>
>>> Thanks
>>> Andriy
>>>
>>> 2015-02-20 9:00 GMT-05:00 Andriy Rysin <ary...@gmail.com>:
>>>> So before wrapping these optimizations up I decided to take a last
>>>> look at the thread graph in jvisualvm and it showed that the worker
>>>> threads spend more time in park state then in running. But the graph
>>>> was really not showing why, it was more like a noodle soup. So I
>>>> brought one of my past optimization back in: to always read file in
>>>> big blocks (don't start analyze/check on each paragraph break), this
>>>> made the thread graph very clear: besides waiting for main thread to
>>>> prepare sentences the check threads run times were not equal (we had
>>>> equal amount of rules per thread which does not actually amount to the
>>>> same load). So I've added another of my test optimizations which
>>>> didn't help before: creating a callable for each rule rather than for
>>>> group of rules.
>>>> The result: my cpu idle state went from 40% to 10% (now pretty much
>>>> all of those 10% is in main thread, we could optimize it too but will
>>>> have to refactor our workflow a bit). The speed went up from ~2500
>>>> (~1900 originally before previous optimizations) to ~2700 sentences/s.
>>>> With this change adding more threads than cpus don't help (actually
>>>> decreases performance) so we could probably get rid of the new
>>>> internal property.
>>>>
>>>> Just to note: there's slight change in output: as we don't split the
>>>> check on each paragraph change in the output some sentences with
>>>> errors will have the beginning of the next sentence (beyond paragraph
>>>> break). Hopefully it's not a big deal.
>>>>
>>>> I will need to work on cleaning thigs up, add changes for
>>>> SameRuleGroupFilter and then will create another branch for everybody
>>>> to test it out.
>>>>
>>>> Andriy
>>>>
>>>> 2015-02-20 8:10 GMT-05:00 Daniel Naber <daniel.na...@languagetool.org>:
>>>>> On 2015-02-19 22:16, Andriy Rysin wrote:
>>>>>
>>>>>> I've merged multithreading branch into master. Please try it out when
>>>>>> you have a chance and let me know if you see any issues.
>>>>> Thanks. Some small cleanup ideas:
>>>>>
>>>>> -setThreadPoolSize should probably be a parameter of the constructor, as
>>>>> calling it after thread pool setup would fail anyway ("Thread pool
>>>>> already initialized")
>>>>> -Does newFixedThreadPool need to use lazy init? If it gets initialized
>>>>> in the constructor, it can also be made final.
>>>>> -It can be 'threadPool' I think, no need for the 'new' and 'fixed' in
>>>>> the variable name.
>>>>>
>>>>> Regards
>>>>>     Daniel
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>>>> Get technology previously reserved for billion-dollar corporations, FREE
>>>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>>>> _______________________________________________
>>>>> Languagetool-devel mailing list
>>>>> Languagetool-devel@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>> ------------------------------------------------------------------------------
>>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>>> Get technology previously reserved for billion-dollar corporations, FREE
>>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to