W dniu 2013-03-01 11:27, Daniel Naber pisze:
> Hi,
>
> one of the significant sources of false alarms in English is the fact that
> LT doesn't properly handle phrases. For example:
>
> "There are several cargo and passenger ferries."
>
> leads to an error because only "several cargo" is considered and LT
> requires "several" to be followed by a plural noun. Instead, "cargo and
> passenger ferries" should be considered one plural noun phrase.
>
> Has anybody an idea how practical it would be to find these phrases with
> disambiguation rules? One could do this (just an example, it doesn't fully
> cover the example above):
>
>      <rule id="NNPS_PHRASE1" name="plural noun phrase">
>          <pattern>
>              <marker>
>                  <token postag="NN"></token>
>              </marker>
>              <token postag="NNS"></token>
>          </pattern>
>          <disambig action="add"><wd pos="NNPS_PHRASE_START"/></disambig>
>      </rule>
>
> Then the rules that now look for plural nouns would have to be changed to
> look for NNPS_PHRASE_START.
>
> Is there a way to get "longest match" with disambiguation rules? It seems
> to me it's at least difficult to remove shorter phrases inside longer
> phrases.
>
> Any ideas or actual rules for this are very welcome. I think this is one of
> the remaining major problems for English (and actually not only English).

Actually, it's possible to build NNP definitions as a cascade of 
disambiguation rules.

There was a successful attempt to do so using Spejd (Polish surface 
language processor), and Constraint grammar does it as well.

The trick is, I believe, to start with longest sequences of nouns, and 
then try to find smaller. One cannot really find a sequence of nouns and 
adjectives longer than 9 elements in real language.

But before we do so, we need to have a lot of disambiguation between 
verbs and nouns. We could use Brill tagger rules as inspiration or 
simply train the Brill tagger on a real corpus (but one that does not 
have "expected" tags but real ones - Brown corpus sometimes has "he do" 
as PRP VBZ instead of PRP VBP). See the ideas page, I've written this 
one up.

Best,
Marcin

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to