Re: Chunker interface added
On 2013-08-24 16:28, R.J. Baars wrote: This is very promising. I would like to know more about this. Could it be added for Dutch, and is it controllable from the xml? It's not, but we could add a new XML that describes chunks. However, chunks are based on part-of-speech tags, and these need to be unambiguous. For example for English, if you have a word walk and you don't know whether it is a noun or a verb, you cannot assign a chunk (noun phrase or verb phrase). Our part-of-speech information for English is ambiguous, because we cannot write disambiguation rules for everything. Thus for English we're using an external component (OpenNLP) to find chunks. For other languages with less ambiguities, we might have more luck and maybe we're able to find a rule based chunking. So anybody who wants a chunker for their languages: try to think about how to detect chunks with simple rules. For English, expressions something like article, any number of adjectives, noun aren't a bad start (but see above - in English finding the noun is not trivial). Alternatively, find an existing chunker component we could add. Regards Daniel -- http://www.danielnaber.de -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Chunker interface added
2013/8/27 Daniel Naber list2...@danielnaber.de On 2013-08-24 16:28, R.J. Baars wrote: This is very promising. I would like to know more about this. Could it be added for Dutch, and is it controllable from the xml? It's not, but we could add a new XML that describes chunks. However, chunks are based on part-of-speech tags, and these need to be unambiguous. For example for English, if you have a word walk and you don't know whether it is a noun or a verb, you cannot assign a chunk (noun phrase or verb phrase). Our part-of-speech information for English is ambiguous, because we cannot write disambiguation rules for everything. Thus for English we're using an external component (OpenNLP) to find chunks. Have you thought on using Apertium resources to disambiguate English and then add a rule based chunker for English? -- Xavi Ivars http://xavi.ivars.me -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Chunker interface added
On 2013-08-27 16:44, Xavi Ivars wrote: Have you thought on using Apertium resources to disambiguate English and then add a rule based chunker for English? Actually OpenNLP also disambiguates, as the first step is POS tagging. I'd just need to find time to try a rule-based approach. What kind of resources does Apertium provide for this, is it different from running the OpenNLP POS tagger? Of course, feel free to give it a try if you want. Just let us know so we don't work on the same task without coordination. Regards Daniel -- http://www.danielnaber.de -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Chunker interface added
2013/8/27 Daniel Naber list2...@danielnaber.de On 2013-08-27 16:44, Xavi Ivars wrote: Have you thought on using Apertium resources to disambiguate English and then add a rule based chunker for English? Actually OpenNLP also disambiguates, as the first step is POS tagging. I'd just need to find time to try a rule-based approach. What kind of resources does Apertium provide for this, is it different from running the OpenNLP POS tagger? I don't think it's much different, but the POS tagger is language-independent (you only need to train it, and it's already trained for a lot of languages). There's some chunking done also in Apertium [1][2], totally rule based (defined in XML), so you might want to have a look [1] http://wiki.apertium.org/wiki/Chunking [2] http://wiki.apertium.org/wiki/Chunking:_A_full_example -- Xavi Ivars http://xavi.ivars.me -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
RE: Chunker interface added
I have some XML rules that I use for POS disambiguation in a term checker. You can take what you want or adapt as necessary from www.simplified-english.co.uk/installation.html. Refer to rulegroup id=POS_DISAMBIGUATION_IDENTIFY_NOUN. (I am in the process of re-writing the rules to make them more robust and more general. If you want to know when I update the rules, send me an e-mail.) Regards, Mike Unwalla Contact: www.techscribe.co.uk/techw/contact.htm -Original Message- From: Daniel Naber [mailto:list2...@danielnaber.de] For English, expressions something like article, any number of adjectives, noun aren't a bad start (but see above - in English finding the noun is not trivial). Alternatively, find an existing chunker component we could add. Regards Daniel -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Chunker interface added
On 2013-08-24 16:28, R.J. Baars wrote: This is very promising. I would like to know more about this. Nothing has been decided yet - it will take some time before I have a working version for English, then we'll see how this can be applied to other languages. Regards Daniel -- http://www.danielnaber.de -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Chunker interface added
Are you going to build a chunker from scratch or rely on existing technology, e.g. the OpenNLP Chunker [1]? Cheers, -- Richard [1] http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.chunker Am 24.08.2013 um 18:26 schrieb Daniel Naber list2...@danielnaber.de: On 2013-08-24 16:28, R.J. Baars wrote: This is very promising. I would like to know more about this. Nothing has been decided yet - it will take some time before I have a working version for English, then we'll see how this can be applied to other languages. Regards Daniel -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Chunker interface added
On 2013-08-24 20:28, Richard Eckart de Castilho wrote: Are you going to build a chunker from scratch or rely on existing technology, e.g. the OpenNLP Chunker [1]? I'll use the one from OpenNLP for now. It's kind of a black box for us, so I'm not sure yet how to handle those cases where OpenNLP gets it wrong. Any ideas about that? Regards Daniel -- http://www.danielnaber.de -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Re: Chunker interface added
Am 24.08.2013 um 21:03 schrieb Daniel Naber list2...@danielnaber.de: On 2013-08-24 20:28, Richard Eckart de Castilho wrote: Are you going to build a chunker from scratch or rely on existing technology, e.g. the OpenNLP Chunker [1]? I'll use the one from OpenNLP for now. It's kind of a black box for us, so I'm not sure yet how to handle those cases where OpenNLP gets it wrong. Any ideas about that? I'm not familiar with its details, but given that it can be trained, it would probably be a good solution to start building a corpus of those sentences it gets wrong and retrain every once in a while with the original corpus plus the manually corrected samples. -- Richard -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel
Chunker interface added
Hi, I have added a Chunker interface that every language can implement. It works like a tagger, but it's not supposed to assign part-of-speech tags to single words, but chunk (phrase) tags. Typical chunks are noun chunks and verb chunks. Typical noun chunks look like this in English: a boy the young boy the clever young boy the wonder boy Why is this relevant? Because we currently have false alarms for sentences like There are over 500 college and university chapters. LT only looks at 500 college and the rule that matches a number followed by a singular noun will be triggered. Instead, the rule needs to match a number followed by a singular noun chunk. With a properly working chunker that's possible. What exists in git is just the interface, but I have an English chunker in a local branch that allows matching noun chunks like this: token chunk=B-NP-singular/ token chunk=I-NP-singular max=-1/ NP means noun phrase, B means beginning, and I means inside. So B-NP-singular could match 'a' or 'the', while I-NP-singular with max=-1 could match 'young boy'. In other words, although chunks are larger entities that span several words, the chunk tags are assigned to each of the words inside a chunk. I'll keep you updated about my progress with making the English chunker work properly. Let me know if you have any questions/comments. Regards Daniel -- http://www.danielnaber.de -- Introducing Performance Central, a new site from SourceForge and AppDynamics. Performance Central is your source for news, insights, analysis and resources for efficient Application Performance Management. Visit us today! http://pubads.g.doubleclick.net/gampad/clk?id=48897511iu=/4140/ostg.clktrk ___ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel