On 02/22/2015 01:18 PM, Marcin Miłkowski wrote:
> W dniu 2015-02-22 o 15:24, Andriy Rysin pisze:
>> On 02/22/2015 04:45 AM, Marcin Miłkowski wrote:
>>> Hi,
>>>
>>>
>>> W dniu 2015-02-21 o 19:22, Andriy Rysin pisze:
>>>> So the main problem with this performance improvement is that we read
>>>> across paragraphs. There are two problems with this:
>>>> 1) error context shows sentences from another paragraph:
>>>> I almost worked out a solution for that by adjusting ContextTools but
>>>> then I found the next one:
>>>> 2) the cross-sentence rules start to work across paragraphs
>>>>
>>>> and when I was analyzing the code I found that if we read from the
>>>> file and it's smaller than 64k we don't parse it by paragraphs. So the
>>>> cross-sentence rules work across paragraph here too.
>>> Let me explain why it works like this. This is because we had a 64K
>>> limit in the past, and I needed to check larger files. So whenever we
>>> have a larger file, I devised a rough input buffer code.
>>>
>>> But this is a dirty solution. I think we should get a proper solution
>>> with an Iterator over a file buffer. Now, the iterator class should be
>>> sensitive to the paragraph setting (\n or \n\n). I guess we could simply
>>> send an extra special token to the tokenizer or something like that so
>>> that we get the PARA_END tag whenever we get to the paragraph boundary.
>>> I understand that the performance is crippled when we wait until we find
>>> the whole paragraph?
>> Not quite, the problem is that we feel the analyze/check worker threads
>> with small chunks of data. I did some stats yesterday and realized that
>> 4 biggest text files I had to run regression on had ~40% of paragraphs
>> with 1-3 sentences. Those are printed media archives and I guess a case
>> of 1 is for chapter titles, newspaper titles etc, author, date etc. When
>> this happens (and you have i.e. 4 cpus) some of the "analyze sentence"
>> worker threads stay idle and also if we invoke "check" threads on only
>> couple of sentences the splitting in threads may produce more overhead
>> than benefit (not sure about this as if you have very big number of rule
>> it may still be faster).
>> When I removed the "checkpoint" at the paragraph level and always send
>> 64k blocks to worker threads (ignoring some regressions) my cpu idle
>> state goes from 40% to 10% (and those 10% are because worker threads
>> wait for main to read the file and tokenize - we could theoretically
>> optimize that one too).
>> I actually have one book which has much longer paragraphs and when I
>> test it cpus are much less idle.
>>
>>> Right now the SRX segmentation handles the line ends as well, so we
>>> would need to look at the connection to the sentence splitter.
>>>
>>>> This can be observed in MainTest.testEnglishFile() which gives 3
>>>> matches vs MainTest.testEnglishStdIn4() which reads the same text but
>>>> using stdin gives 4.
>>> And it should give 3, right? Paragraph-level rules are doing something
>>> useful.
>> It depends, it should give 3 if paragraph-level rules should not work
>> across paragraph boundaries, and it should give 4 if it should work across.
>>
>>>> If we are to fix the "small file" case by splitting paragraph would it
>>>> make sense to remove special hanlding for small files? If it's small
>>>> it would be checked fast anyway and removing extra if/else blocks
>>>> would clean up the code logic...
>>> I think we should seal the logic in an iterator, and it would work the
>>> same way for all cases.
>> So to move on with more optimizations (by sending bigger blocks to
>> worker threads) we need several things:
>> 1) agree if paragraph-level rules should work across paragraphs,
> They should not, at least they were designed to work in a single 
> paragraph only. That was my idea back then.
>
> We could have rules that work on the whole file level if we want to have 
> it across paragraphs.
Ok, so I'll try to summarize:
1) paragraph-level rules should only work inside paragraph (so
short-file check currently does not follow this logic but we can fix it
easily by removing short-file code and use generic logic for all file sizes)
2) match context should only include sentences from current paragraph,
currently it works (for generic logic) because when we read the file we
split it by paragraph
>
>> if yes
>> there's not much extra work, if no then we have to make sure paragraph
>> boundaries are set by sentence tokenizer rather than file reader, and
>> add logic to the paragraph-level rules to stop at paragraph boundary; it
>> seems that sentence tokenizer already adds newline at the end of last
>> sentence in paragraph but I gave up before fully understand how it's set
>> and used.
> Actually, as far as I remember, the tokenizer does not add any code for 
> paragraphs. It just splits a sentence on one or on two newlines, 
> depending on what you set on the command-line (using -b).
>
> I believe the paragraph code is added in the 
> JLanguageTool.GetAnalyzedSentence(). It just needs to know how many 
> newlines make a paragraph. But I didn't read the current code and I 
> simply remember what I wanted to code. Maybe I did some dirty hack...
Yes, getAnalyzedSentence() adds paragraph end to the last token of the
last sentence, but it relies on the file reader to stop at the paragraph
break.

>
>> IMHO splitting text into paragraphs should not be in commandline/file
>> reader but in the core logic (e.g. sentence tokenizer)
> It's not in the reader, IMHO, right now.
org.languagetool.commandline.Main reads the file by paragraphs that's
why getAnalyzeSentence() and check don't have to care about paragraph
boundaries.
So if we don't split by paragraph in the reader we need to add paragraph
boundaries, I think sentence tokenizer is best place for it but I am not
an expert here.
>> If we agree on this we can also merge the code for small/large file to
>> be the same.
> Yes, I guess we should buffer the lines to get a big chunk for checking. 
> Actually, I need to design a similar solution for the server I'm calling 
> via HTTP in a Python script: I have thousands of small chunks that I 
> need to glue together for performance reasons.
Yes, but we need to introduce paragraph boundaries in tokenizer (or in
sentence analyzer) before we can do that.
>
>> 2) agree if match context information can contain sentences from other
>> paragraphs, if we don't do anything it will, if we want to remove other
>> paragraph sentences from context I have some relatively small code that
>> does it for plain text output but I need more code to make it work for
>> api output
> I think we should display the same paragraph. Basically, it would be 
> much more helpful to get a whole sentence in the API output as well.
I like the idea of showing whole sentence in the context (and then we
actually would not care about other paragraphs).

Andriy


------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to