W dniu 2014-03-26 13:51, Daniel Naber pisze: > On 2014-03-25 14:24, Marcin Miłkowski wrote: > >>> So instead of just adding the POS tag we get from Morfologik to our >>> AnalyzedToken object as a string, we interpret it and store something >>> like pos = preposition, case = accusative. Is it that what you mean? >> >> Exactly. > > Any ideas on how the VBP tag (in English) might fit into this approach, > i.e. "not 3rd person singular"? Will we need to introduce a tag like > "pos = Not3rdPsSgVerb"? That doesn't seem elegant but keeps it short.
No, that would be horrible, as this is not an improvement. The problem is not that tags are cryptic and short; it is that they do not make features easily available separately. My use case for readable pos tags is also speed and simplicity for unification (rules that use agreement between words). It is simply faster to specify features by citing appropriate attributes that can be processed once instead of running a regexp every time the sentence is processed in a unification rule. For Catalan, Polish, and French this will be a huge time improvement. Now, for this to work the attributes should be specified just like they are in Corpus Query Language (CQL). > > Internally, it could be expanded to mean: > [{pos=verb, person=1|2, number=singular, tense=preset}, > {pos=verb, person=1|2|3, number=plural, tense=preset}] So for the word tagged as VBP we could have <token pos = "verb" person="1|2" number="sg" en:tense="present"/> or <token pos = "verb" person="3" number="pl" en:tense="present"/> Both would match a word with VBP. (Note that the disambiguator could even remove one of the interpretations to make it clear that this is a plural use of the token!) Above, I used a mixture of attributes without namespaces (these would be universal for all languages) and ones with namespaces, like tense, which is not present in all languages. We can look at proposed Universal Tagset to find universal categories: https://code.google.com/p/universal-pos-tags/ Note also that one could write: <token pos="verb"/> And this would be equivalent to: <token postag="VB.*" postag_regexp="yes"/> But possibly a lot faster. The new syntax comes out also much easier to read, and would be equivalent to CQL query: [pos="verb"] Similarly for words in comparative degree, where you have to use now (for English): <token postag="..R" postag_regexp="yes"/> You could simply say: <token degree="com"/> Basically, by making attributes separate we could have a much easier way to write complex rules without problems as to how specify POS tags. I consider myself to be a power user but with a complex Polish tagset it is sometimes really difficult to specify the features I want using regexes: the tagset itself creates pretty complex and lengthy strings and a lot of time is needed to make sure that the regex matches. Regards, Marcin > > Regards > Daniel > > > ------------------------------------------------------------------------------ > Learn Graph Databases - Download FREE O'Reilly Book > "Graph Databases" is the definitive new guide to graph databases and their > applications. Written by three acclaimed leaders in the field, > this first edition is now available. Download your free book today! > http://p.sf.net/sfu/13534_NeoTech > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Learn Graph Databases - Download FREE O'Reilly Book "Graph Databases" is the definitive new guide to graph databases and their applications. Written by three acclaimed leaders in the field, this first edition is now available. Download your free book today! http://p.sf.net/sfu/13534_NeoTech _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel