Hi Marcin I was actually thinking for something even more abstract. To adjust your example: <token postag="verb"> <dictionary="valency" category="noun phrase" tag="accusative"/> </token>
<token postag="verb"> <dictionary="somethingelse" category="some_cateogry" tag=".*xxx.*" tag_regexp="yes"/> </token> or <token postag="verb"> <dictionary="somethingelse" category="some_cateogry" regexp="yes">.*xxx.*</dictionary> </token> In this case LT core would only know that there's set of extra "dictionary" information that has category and a value. This way we can add many different type of dictionaries, that could be specific to each language and rules for that language will use that language-specific information. Core LT will allow each language to provide this extra information and will not care much about its content. Regards, Andriy 2016-02-02 14:40 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>: > W dniu 02.02.2016 o 18:08, Andriy Rysin pisze: >> Hey Marcin >> >> this is great addition, though I have one remark. Besides valency >> information some other type of information could be useful too (if we >> starting to head this direction). E.g. I have rules in Ukrainian that >> suggests superlative form for adjective when "самий" (very) + base >> form is used. Currently I have the relation between base form and >> comparative/superlative forms encoded in the dictionary but in general >> this is higher-level information that should be stored outside of the >> tag dictionary. > > I would argue that in some languages (at least in Polish and English) > this is not a semantic-level information, this is a grammatical > information, or morphosyntactic information. > >> >> I am wondering if we could develop more generic approach for such >> additional (semantic) information, e.g. split each type of this info >> into category and allow generic references in the token/exception, >> something like this: >> >> <token >> semantic_info="<semantic_category_name>:<WHATEVER_STRING_FROM_THAT_CATEGORY>"/> >> >> or even as a subelement (I assume semantic information can get pretty >> long/complicated so child element may be better choice and will allow >> to add new attributes easily on it later) >> >> <token> >> <semantic category="<semantic_category_name1>" >> value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/> >> <semantic category="<semantic_category_name2>" >> value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/> >> </token> >> >> so in valency case you described (1st case) it could be: >> >> <token postag="verb"> >> <semantic category="valence" value="WHATEVER_STRING"/> >> </token> > > Valency is definitely not a semantic category: > > https://en.wikipedia.org/wiki/Valency_(linguistics) > > But your approach seems quite elegant. I would argue that valency is one > kind of information that should be treated as key-value > > <token postag="verb"> > <valency category="noun phrase" value="accusative"/> > </token> > > This would match a verb that takes an accusative noun phrase (of course, > the values would be defined per valency lexicon in a language). There > are free valency lexicons for many languages beside Polish. > >> >> Thus if we add other semantic information into LT we can use this info >> in the logic without changing the LT core. > > The core XML parsing will have to be changed anyway. > > Best, > Marcin > >> >> Thanks >> Andriy >> >> 2016-01-28 7:30 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>: >>> Hi all, >>> >>> To allow for better disambiguation and have better rules, I need to >>> include a valency dictionary with LT. These are dictionaries that >>> specify which grammatical cases or prepositions go with which verbs etc. >>> There are such resources for many languages that we support. And using >>> these resources, we could enrich POS tag disambiguation a lot (I'm using >>> a horribly long regular expression right now instead of a dictionary, >>> for example), and write up a lot of important rules. >>> >>> The obvious choice for representing the dictionary (which is available >>> for Polish on a fairly liberal license) is to use a finite-state lexicon >>> that we normally use for taggers. The dictionary will be applied after >>> tagging because valency dictionary will require POS tag + lexeme >>> information. In Polish, the entries look like this: >>> >>> absurdalny: pewny: : : : {prepnp(dla,gen)} >>> absurdalny: pewny: : : : {prepnp(w,loc)} >>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(gdy)} >>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(int)} >>> absurdalny: potoczny: : pred: : {prepnp(dla,gen)}+{cp(jak)} >>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(jeśli)} >>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(kiedy)} >>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(że)} >>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(żeby)} >>> >>> But for French (see http://bach.arts.kuleuven.be/dicovalence/) they are >>> paragraph-based: >>> >>> VAL$ abaisser: P0 P1 >>> VTYPE$ predicator simple >>> VERB$ ABAISSER/abaisser >>> NUM$ 10 >>> EG$ il faudra abaisser la persienne >>> TR_DU$ laten zakken, neerhalen, neerlaten, doen dalen >>> TR_EN$ let down, lower >>> FRAME$ subj:pron|n:[hum], obj:pron|n:[nhum,?abs] >>> P0$ (que), qui, je, nous, elle, il, ils, on, (ça), (ceci), celui-ci, >>> ceux-ci >>> P1$ que, la, le, les, en Q, ça, ceci, celui-ci, ceux-ci >>> RP$ passif être, se passif >>> AUX$ avoir >>> >>> >>> VAL$ abaisser: P0 (P1) >>> VTYPE$ predicator simple >>> VERB$ ABAISSER/abaisser >>> NUM$ 20 >>> EG$ il a raconté cette anecdote pour m'abaisser >>> TR_DU$ vernederen, kleineren >>> TR_EN$ humiliate >>> FRAME$ subj:pron|n:[hum], ?obj:pron|n:[hum] >>> P0$ (que), qui, je, nous, elle, il, ils, on, (ça), celui-ci, ceux-ci, >>> (ça(de_inf)) >>> P1$ 0, qui, te, vous, la, le, les, se réc., en Q, celui-ci, ceux-ci, >>> l'un l'autre >>> RP$ passif être, se faire passif >>> AUX$ avoir >>> >>> >>> I would also add a new list of valency attributes to every >>> AnalyzedToken, simply as a string value (parsing the string would be >>> overkill because there might be different ways of encoding valency >>> information for different languages), with appropriate getters and setters. >>> >>> To use this, we will need some additional attributes in XML elements: >>> token and exception. The following notation seems to be fairly fine: >>> >>> <token postag="verb" valence="WHATEVER_STRING"/> >>> >>> I think matching valence should, by default, use regular expressions. >>> I'm not sure if negation is needed. >>> >>> Alternatively, we could go for attribute-value pairs but the problem >>> would be how to make this language-independent and use this in our XML >>> notation. The easiest way would be then probably define types and their >>> ids just the way I did this for unification. For French, this could look >>> like: >>> >>> <token>abaisser<valency id="AUX">avoir</valency><valency >>> id="VTYPE">predicator simple</valency></token> >>> >>> Of course, the textual values of <valency> could be made into regexes >>> and negated. >>> >>> For this, of course we would need to write up a key-value parser for all >>> particular valency dictionaries. But this would definitely speed up >>> matching and would make writing rules much easier. >>> >>> Best >>> >>> ------------------------------------------------------------------------------ >>> Site24x7 APM Insight: Get Deep Visibility into Application Performance >>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month >>> Monitor end-to-end web transactions and take corrective actions now >>> Troubleshoot faster and improve end-user experience. Signup Now! >>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 >>> _______________________________________________ >>> Languagetool-devel mailing list >>> Languagetool-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> >> ------------------------------------------------------------------------------ >> Site24x7 APM Insight: Get Deep Visibility into Application Performance >> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month >> Monitor end-to-end web transactions and take corrective actions now >> Troubleshoot faster and improve end-user experience. Signup Now! >> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 >> _______________________________________________ >> Languagetool-devel mailing list >> Languagetool-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/languagetool-devel >> > > > ------------------------------------------------------------------------------ > Site24x7 APM Insight: Get Deep Visibility into Application Performance > APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month > Monitor end-to-end web transactions and take corrective actions now > Troubleshoot faster and improve end-user experience. Signup Now! > http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140 > _______________________________________________ > Languagetool-devel mailing list > Languagetool-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/languagetool-devel ------------------------------------------------------------------------------ Site24x7 APM Insight: Get Deep Visibility into Application Performance APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month Monitor end-to-end web transactions and take corrective actions now Troubleshoot faster and improve end-user experience. Signup Now! http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140 _______________________________________________ Languagetool-devel mailing list Languagetool-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/languagetool-devel