Hey Marcin

this is great addition, though I have one remark. Besides valency
information some other type of information could be useful too (if we
starting to head this direction). E.g. I have rules in Ukrainian that
suggests superlative form for adjective when "самий" (very) + base
form is used. Currently I have the relation between base form and
comparative/superlative forms encoded in the dictionary but in general
this is higher-level information that should be stored outside of the
tag dictionary.

I am wondering if we could develop more generic approach for such
additional (semantic) information, e.g. split each type of this info
into category and allow generic references in the token/exception,
something like this:

<token 
semantic_info="<semantic_category_name>:<WHATEVER_STRING_FROM_THAT_CATEGORY>"/>

or even as a subelement (I assume semantic information can get pretty
long/complicated so child element may be better choice and will allow
to add new attributes easily on it later)

<token>
<semantic category="<semantic_category_name1>"
value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/>
<semantic category="<semantic_category_name2>"
value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/>
</token>

so in valency case you described (1st case) it could be:

<token postag="verb">
  <semantic category="valence" value="WHATEVER_STRING"/>
</token>

Thus if we add other semantic information into LT we can use this info
in the logic without changing the LT core.

Thanks
Andriy

2016-01-28 7:30 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
> Hi all,
>
> To allow for better disambiguation and have better rules, I need to
> include a valency dictionary with LT. These are dictionaries that
> specify which grammatical cases or prepositions go with which verbs etc.
> There are such resources for many languages that we support. And using
> these resources, we could enrich POS tag disambiguation a lot (I'm using
> a horribly long regular expression right now instead of a dictionary,
> for example), and write up a lot of important rules.
>
> The obvious choice for representing the dictionary (which is available
> for Polish on a fairly liberal license) is to use a finite-state lexicon
> that we normally use for taggers. The dictionary will be applied after
> tagging because valency dictionary will require POS tag + lexeme
> information. In Polish, the entries look like this:
>
> absurdalny: pewny: : : : {prepnp(dla,gen)}
> absurdalny: pewny: : : : {prepnp(w,loc)}
> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(gdy)}
> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(int)}
> absurdalny: potoczny: : pred: : {prepnp(dla,gen)}+{cp(jak)}
> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(jeśli)}
> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(kiedy)}
> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(że)}
> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(żeby)}
>
> But for French (see http://bach.arts.kuleuven.be/dicovalence/) they are
> paragraph-based:
>
> VAL$    abaisser: P0 P1
> VTYPE$  predicator simple
> VERB$   ABAISSER/abaisser
> NUM$    10
> EG$     il faudra abaisser la persienne
> TR_DU$  laten zakken, neerhalen, neerlaten, doen dalen
> TR_EN$  let down, lower
> FRAME$  subj:pron|n:[hum], obj:pron|n:[nhum,?abs]
> P0$     (que), qui, je, nous, elle, il, ils, on, (ça), (ceci), celui-ci, 
> ceux-ci
> P1$     que, la, le, les, en Q, ça, ceci, celui-ci, ceux-ci
> RP$     passif être, se passif
> AUX$    avoir
>
>
> VAL$    abaisser: P0 (P1)
> VTYPE$  predicator simple
> VERB$   ABAISSER/abaisser
> NUM$    20
> EG$     il a raconté cette anecdote pour m'abaisser
> TR_DU$  vernederen, kleineren
> TR_EN$  humiliate
> FRAME$  subj:pron|n:[hum], ?obj:pron|n:[hum]
> P0$     (que), qui, je, nous, elle, il, ils, on, (ça), celui-ci, ceux-ci,
> (ça(de_inf))
> P1$     0, qui, te, vous, la, le, les, se réc., en Q, celui-ci, ceux-ci,
> l'un l'autre
> RP$     passif être, se faire passif
> AUX$    avoir
>
>
> I would also add a new list of valency attributes to every
> AnalyzedToken, simply as a string value (parsing the string would be
> overkill because there might be different ways of encoding valency
> information for different languages), with appropriate getters and setters.
>
> To use this, we will need some additional attributes in XML elements:
> token and exception. The following notation seems to be fairly fine:
>
> <token postag="verb" valence="WHATEVER_STRING"/>
>
> I think matching valence should, by default, use regular expressions.
> I'm not sure if negation is needed.
>
> Alternatively, we could go for attribute-value pairs but the problem
> would be how to make this language-independent and use this in our XML
> notation. The easiest way would be then probably define types and their
> ids just the way I did this for unification. For French, this could look
> like:
>
> <token>abaisser<valency id="AUX">avoir</valency><valency
> id="VTYPE">predicator simple</valency></token>
>
> Of course, the textual values of <valency> could be made into regexes
> and negated.
>
> For this, of course we would need to write up a key-value parser for all
> particular valency dictionaries. But this would definitely speed up
> matching and would make writing rules much easier.
>
> Best
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to