Hey Marcin this is great addition, though I have one remark. Besides valency information some other type of information could be useful too (if we starting to head this direction). E.g. I have rules in Ukrainian that suggests superlative form for adjective when "самий" (very) + base form is used. Currently I have the relation between base form and comparative/superlative forms encoded in the dictionary but in general this is higher-level information that should be stored outside of the tag dictionary.
I am wondering if we could develop more generic approach for such additional (semantic) information, e.g. split each type of this info into category and allow generic references in the token/exception, something like this: <token semantic_info="<semantic_category_name>:<WHATEVER_STRING_FROM_THAT_CATEGORY>"/> or even as a subelement (I assume semantic information can get pretty long/complicated so child element may be better choice and will allow to add new attributes easily on it later) <token> <semantic category="<semantic_category_name1>" value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/> <semantic category="<semantic_category_name2>" value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/> </token> so in valency case you described (1st case) it could be: <token postag="verb"> <semantic category="valence" value="WHATEVER_STRING"/> </token> Thus if we add other semantic information into LT we can use this info in the logic without changing the LT core. Thanks Andriy 2016-01-28 7:30 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>: > Hi all, > > To allow for better disambiguation and have better rules, I need to > include a valency dictionary with LT. These are dictionaries that > specify which grammatical cases or prepositions go with which verbs etc. > There are such resources for many languages that we support. And using > these resources, we could enrich POS tag disambiguation a lot (I'm using > a horribly long regular expression right now instead of a dictionary, > for example), and write up a lot of important rules. > > The obvious choice for representing the dictionary (which is available > for Polish on a fairly liberal license) is to use a finite-state lexicon > that we normally use for taggers. The dictionary will be applied after > tagging because valency dictionary will require POS tag + lexeme > information. In Polish, the entries look like this: > > absurdalny: pewny: : : : {prepnp(dla,gen)} > absurdalny: pewny: : : : {prepnp(w,loc)} > absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(gdy)} > absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(int)} > absurdalny: potoczny: : pred: : {prepnp(dla,gen)}+{cp(jak)} > absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(jeśli)} > absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(kiedy)} > absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(że)} > absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(żeby)} > > But for French (see http://bach.arts.kuleuven.be/dicovalence/) they are > paragraph-based: > > VAL$ abaisser: P0 P1 > VTYPE$ predicator simple > VERB$ ABAISSER/abaisser > NUM$ 10 > EG$ il faudra abaisser la persienne > TR_DU$ laten zakken, neerhalen, neerlaten, doen dalen > TR_EN$ let down, lower > FRAME$ subj:pron|n:[hum], obj:pron|n:[nhum,?abs] > P0$ (que), qui, je, nous, elle, il, ils, on, (ça), (ceci), celui-ci, > ceux-ci > P1$ que, la, le, les, en Q, ça, ceci, celui-ci, ceux-ci > RP$ passif être, se passif > AUX$ avoir > > > VAL$ abaisser: P0 (P1) > VTYPE$ predicator simple > VERB$ ABAISSER/abaisser > NUM$ 20 > EG$ il a raconté cette anecdote pour m'abaisser > TR_DU$ vernederen, kleineren > TR_EN$ humiliate > FRAME$ subj:pron|n:[hum], ?obj:pron|n:[hum] > P0$ (que), qui, je, nous, elle, il, ils, on, (ça), celui-ci, ceux-ci, > (ça(de_inf)) > P1$ 0, qui, te, vous, la, le, les, se réc., en Q, celui-ci, ceux-ci, > l'un l'autre > RP$ passif être, se faire passif > AUX$ avoir > > > I would also add a new list of valency attributes to every > AnalyzedToken, simply as a string value (parsing the string would be > overkill because there might be different ways of encoding valency > information for different languages), with appropriate getters and setters. > > To use this, we will need some additional attributes in XML elements: > token and exception. The following notation seems to be fairly fine: > > <token postag="verb" valence="WHATEVER_STRING"/> > > I think matching valence should, by default, use regular expressions. > I'm not sure if negation is needed. > > Alternatively, we could go for attribute-value pairs but the problem > would be how to make this language-independent and use this in our XML > notation. The easiest way would be then probably define types and their > ids just the way I did this for unification. For French, this could look > like: > > <token>abaisser<valency id="AUX">avoir</valency><valency > id="VTYPE">predicator simple</valency></token> > > Of course, the textual values of <valency> could be made into regexes > and negated. > > For this, of course we would need to write up a key-value parser for all > particular valency dictionaries. But this would definitely speed up > matching and would make writing rules much easier. > > Best > > ------------------------------------------------------------------------------ > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel