I will have to look into the Brill thing. Never heard of it before.

About paragraphs and context: I was always taught that when changing the 
subject, you need a new paragraph.
Might be practical to generally ignore everything that is not in the 
sentence.

My data to process is paragraph-based however, since dividing paragraphs 
into sentences is far from reliable as yet. Working on improvement 
options there too.

Ruud

On 17-05-12 11:23, Marcin Miłkowski wrote:
> Ruud,
>
> W dniu 2012-05-17 10:33, Ruud Baars pisze:
>> When there is a word confusion, we need the determination if the other
>> words in the sentence (better: paragraph) indicate the other meaning
>> sigificantly.
>>
>> That will need a LT XML rule for every potential confusion, or one
>> special Java rule that takes care of it.
>>
>> So word1,word2,contextwords1,contextwords2 is all you need to process
>> it, imho. There is no classification really necessary. It is more
>> fundamental, but also more difficult.
> Not really, the Brill-tagger rules deal with it correctly on the
> sentence level. The paragraph-level dependencies are negligible. You're
> actually lumping together two different things that I mentioned:
>
> - classification (useless for the problem you mention),
> - automatic rule learning or language model learning (a solution for you).
>
> Look at After the Deadline docs, they have some explanation of
> statistical modeling.
>
> Marcin
>
>> These kind of entries might be easy to generate. Actually, I am
>> collecting word-word combinations right now, and will be able to
>> determine if the relations are significant. It takes a lot of computer
>> time, but there is time.
>>
>> Ruud
>>
>> On 16-05-12 22:21, Marcin Miłkowski wrote:
>>> W dniu 2012-05-16 20:10, Jan Schreiber pisze:
>>>
>>>> BTW, it should be possible to store at least those entities outside the
>>>> file itself, but I don't know how. --Jan
>>> Well, I had a look and it seems that you are using some of the entities
>>> to define fairly long regular expressions (disjunctions). This slows
>>> down LT quite substantially (I profiled some rules in the Polish XML
>>> file). I had such long lists for Polish reflexive verbs, and I decided
>>> to add a new POS tag for that, and it made processing much faster.
>>>
>>> But my solution was a hack that can be made more general. We do not need
>>> to be include such new classifications in the normal tagger file: as our
>>> taggers can be used instead of all such disjunctive regular expressions,
>>> you could also simply include lists of adjectives referring to languages
>>> (sprachadj) in a dedicated semantic tagger file. This might be read by a
>>> manual tagger or a morfologik-stemming tagger (which will definitely
>>> work faster). We could, in principle, add a new attribute - a "semantic
>>> classification tag" - to XML that would be differentiated from a normal
>>> POS tag, and use our existing tagger infrastructure to support this new
>>> feature.
>>>
>>> I planned to use some parts of the Polish Wordnet for some rules, and
>>> only recently it was made available under a BSD-like license.
>>> Classifying some of the words semantically might be really useful for
>>> some rules.
>>>
>>> Regards
>>> Marcin
>>>
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to