Hi all,

To allow for better disambiguation and have better rules, I need to 
include a valency dictionary with LT. These are dictionaries that 
specify which grammatical cases or prepositions go with which verbs etc. 
There are such resources for many languages that we support. And using 
these resources, we could enrich POS tag disambiguation a lot (I'm using 
a horribly long regular expression right now instead of a dictionary, 
for example), and write up a lot of important rules.

The obvious choice for representing the dictionary (which is available 
for Polish on a fairly liberal license) is to use a finite-state lexicon 
that we normally use for taggers. The dictionary will be applied after 
tagging because valency dictionary will require POS tag + lexeme 
information. In Polish, the entries look like this:

absurdalny: pewny: : : : {prepnp(dla,gen)}
absurdalny: pewny: : : : {prepnp(w,loc)}
absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(gdy)}
absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(int)}
absurdalny: potoczny: : pred: : {prepnp(dla,gen)}+{cp(jak)}
absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(jeśli)}
absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(kiedy)}
absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(że)}
absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(żeby)}

But for French (see http://bach.arts.kuleuven.be/dicovalence/) they are 
paragraph-based:

VAL$    abaisser: P0 P1
VTYPE$  predicator simple
VERB$   ABAISSER/abaisser
NUM$    10
EG$     il faudra abaisser la persienne
TR_DU$  laten zakken, neerhalen, neerlaten, doen dalen
TR_EN$  let down, lower
FRAME$  subj:pron|n:[hum], obj:pron|n:[nhum,?abs]
P0$     (que), qui, je, nous, elle, il, ils, on, (ça), (ceci), celui-ci, ceux-ci
P1$     que, la, le, les, en Q, ça, ceci, celui-ci, ceux-ci
RP$     passif être, se passif
AUX$    avoir


VAL$    abaisser: P0 (P1)
VTYPE$  predicator simple
VERB$   ABAISSER/abaisser
NUM$    20
EG$     il a raconté cette anecdote pour m'abaisser
TR_DU$  vernederen, kleineren
TR_EN$  humiliate
FRAME$  subj:pron|n:[hum], ?obj:pron|n:[hum]
P0$     (que), qui, je, nous, elle, il, ils, on, (ça), celui-ci, ceux-ci, 
(ça(de_inf))
P1$     0, qui, te, vous, la, le, les, se réc., en Q, celui-ci, ceux-ci, 
l'un l'autre
RP$     passif être, se faire passif
AUX$    avoir


I would also add a new list of valency attributes to every 
AnalyzedToken, simply as a string value (parsing the string would be 
overkill because there might be different ways of encoding valency 
information for different languages), with appropriate getters and setters.

To use this, we will need some additional attributes in XML elements: 
token and exception. The following notation seems to be fairly fine:

<token postag="verb" valence="WHATEVER_STRING"/>

I think matching valence should, by default, use regular expressions. 
I'm not sure if negation is needed.

Alternatively, we could go for attribute-value pairs but the problem 
would be how to make this language-independent and use this in our XML 
notation. The easiest way would be then probably define types and their 
ids just the way I did this for unification. For French, this could look 
like:

<token>abaisser<valency id="AUX">avoir</valency><valency 
id="VTYPE">predicator simple</valency></token>

Of course, the textual values of <valency> could be made into regexes 
and negated.

For this, of course we would need to write up a key-value parser for all 
particular valency dictionaries. But this would definitely speed up 
matching and would make writing rules much easier.

Best

------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to