Re: Chunker interface added

2013-08-27 Thread Daniel Naber
On 2013-08-24 16:28, R.J. Baars wrote:

 This is very promising. I would like to know more about this.
 Could it be added for Dutch, and is it controllable from the xml?

It's not, but we could add a new XML that describes chunks. However, 
chunks are based on part-of-speech tags, and these need to be 
unambiguous. For example for English, if you have a word walk and you 
don't know whether it is a noun or a verb, you cannot assign a chunk 
(noun phrase or verb phrase). Our part-of-speech information for English 
is ambiguous, because we cannot write disambiguation rules for 
everything. Thus for English we're using an external component (OpenNLP) 
to find chunks.

For other languages with less ambiguities, we might have more luck and 
maybe we're able to find a rule based chunking. So anybody who wants a 
chunker for their languages: try to think about how to detect chunks 
with simple rules. For English, expressions something like article, any 
number of adjectives, noun aren't a bad start (but see above - in 
English finding the noun is not trivial). Alternatively, find an 
existing chunker component we could add.

Regards
  Daniel

-- 
http://www.danielnaber.de

--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Chunker interface added

2013-08-27 Thread Xavi Ivars
2013/8/27 Daniel Naber list2...@danielnaber.de

 On 2013-08-24 16:28, R.J. Baars wrote:

  This is very promising. I would like to know more about this.
  Could it be added for Dutch, and is it controllable from the xml?

 It's not, but we could add a new XML that describes chunks. However,
 chunks are based on part-of-speech tags, and these need to be
 unambiguous. For example for English, if you have a word walk and you
 don't know whether it is a noun or a verb, you cannot assign a chunk
 (noun phrase or verb phrase). Our part-of-speech information for English
 is ambiguous, because we cannot write disambiguation rules for
 everything. Thus for English we're using an external component (OpenNLP)
 to find chunks.


Have you thought on using Apertium resources to disambiguate English and
then add a rule based chunker for English?

-- 
 Xavi Ivars 
 http://xavi.ivars.me 
--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Chunker interface added

2013-08-27 Thread Daniel Naber
On 2013-08-27 16:44, Xavi Ivars wrote:

 Have you thought on using Apertium resources to disambiguate English
 and then add a rule based chunker for English?

Actually OpenNLP also disambiguates, as the first step is POS tagging. 
I'd just need to find time to try a rule-based approach. What kind of 
resources does Apertium provide for this, is it different from running 
the OpenNLP POS tagger?

Of course, feel free to give it a try if you want. Just let us know so 
we don't work on the same task without coordination.

Regards
  Daniel

-- 
http://www.danielnaber.de

--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Chunker interface added

2013-08-27 Thread Xavi Ivars
2013/8/27 Daniel Naber list2...@danielnaber.de

 On 2013-08-27 16:44, Xavi Ivars wrote:

  Have you thought on using Apertium resources to disambiguate English
  and then add a rule based chunker for English?

 Actually OpenNLP also disambiguates, as the first step is POS tagging.
 I'd just need to find time to try a rule-based approach. What kind of
 resources does Apertium provide for this, is it different from running
 the OpenNLP POS tagger?


I don't think it's much different, but the POS tagger is
language-independent (you only need to train it, and it's already trained
for a lot of languages).

There's some chunking done also in Apertium [1][2], totally rule based
(defined in XML), so you might want to have a look

[1] http://wiki.apertium.org/wiki/Chunking
[2] http://wiki.apertium.org/wiki/Chunking:_A_full_example
-- 
 Xavi Ivars 
 http://xavi.ivars.me 
--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


RE: Chunker interface added

2013-08-27 Thread Mike Unwalla
I have some XML rules that I use for POS disambiguation in a term checker.
You can take what you want or adapt as necessary from
www.simplified-english.co.uk/installation.html. Refer to rulegroup
id=POS_DISAMBIGUATION_IDENTIFY_NOUN.

(I am in the process of re-writing the rules to make them more robust and
more general. If you want to know when I update the rules, send me an
e-mail.)

Regards,

Mike Unwalla
Contact: www.techscribe.co.uk/techw/contact.htm 

-Original Message-
From: Daniel Naber [mailto:list2...@danielnaber.de] 

 For English, expressions something like article, any 
number of adjectives, noun aren't a bad start (but see above - in 
English finding the noun is not trivial). Alternatively, find an 
existing chunker component we could add.

Regards
  Daniel



--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Chunker interface added

2013-08-24 Thread Daniel Naber
On 2013-08-24 16:28, R.J. Baars wrote:

 This is very promising. I would like to know more about this.

Nothing has been decided yet - it will take some time before I have a 
working version for English, then we'll see how this can be applied to 
other languages.

Regards
  Daniel

-- 
http://www.danielnaber.de

--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Chunker interface added

2013-08-24 Thread Richard Eckart de Castilho
Are you going to build a chunker from scratch or rely on existing
technology, e.g. the OpenNLP Chunker [1]?

Cheers,

-- Richard

[1] 
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.chunker

Am 24.08.2013 um 18:26 schrieb Daniel Naber list2...@danielnaber.de:

 On 2013-08-24 16:28, R.J. Baars wrote:
 
 This is very promising. I would like to know more about this.
 
 Nothing has been decided yet - it will take some time before I have a 
 working version for English, then we'll see how this can be applied to 
 other languages.
 
 Regards
  Daniel


--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Chunker interface added

2013-08-24 Thread Daniel Naber
On 2013-08-24 20:28, Richard Eckart de Castilho wrote:

 Are you going to build a chunker from scratch or rely on existing
 technology, e.g. the OpenNLP Chunker [1]?

I'll use the one from OpenNLP for now. It's kind of a black box for us, 
so I'm not sure yet how to handle those cases where OpenNLP gets it 
wrong. Any ideas about that?

Regards
  Daniel

-- 
http://www.danielnaber.de

--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Re: Chunker interface added

2013-08-24 Thread Richard Eckart de Castilho
Am 24.08.2013 um 21:03 schrieb Daniel Naber list2...@danielnaber.de:

 On 2013-08-24 20:28, Richard Eckart de Castilho wrote:
 
 Are you going to build a chunker from scratch or rely on existing
 technology, e.g. the OpenNLP Chunker [1]?
 
 I'll use the one from OpenNLP for now. It's kind of a black box for us, 
 so I'm not sure yet how to handle those cases where OpenNLP gets it 
 wrong. Any ideas about that?

I'm not familiar with its details, but given that it can be trained, it
would probably be a good solution to start building a corpus of those sentences
it gets wrong and retrain every once in a while with the original corpus plus
the manually corrected samples. 

-- Richard
--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel


Chunker interface added

2013-08-21 Thread Daniel Naber
Hi,

I have added a Chunker interface that every language can implement. It 
works like a tagger, but it's not supposed to assign part-of-speech tags 
to single words, but chunk (phrase) tags. Typical chunks are noun chunks 
and verb chunks. Typical noun chunks look like this in English:

a boy
the young boy
the clever young boy
the wonder boy

Why is this relevant? Because we currently have false alarms for 
sentences like There are over 500 college and university chapters. LT 
only looks at 500 college and the rule that matches a number followed 
by a singular noun will be triggered. Instead, the rule needs to match a 
number followed by a singular noun chunk. With a properly working 
chunker that's possible.

What exists in git is just the interface, but I have an English chunker 
in a local branch that allows matching noun chunks like this:

token chunk=B-NP-singular/
token chunk=I-NP-singular max=-1/

NP means noun phrase, B means beginning, and I means inside. So 
B-NP-singular could match 'a' or 'the', while I-NP-singular with 
max=-1 could match 'young boy'. In other words, although chunks are 
larger entities that span several words, the chunk tags are assigned to 
each of the words inside a chunk.

I'll keep you updated about my progress with making the English chunker 
work properly. Let me know if you have any questions/comments.

Regards
  Daniel

-- 
http://www.danielnaber.de

--
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk
___
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel