Re: Valency dictionary and attribute [long mail]

Marcin Miłkowski Mon, 08 Feb 2016 08:48:16 -0800

Hi Andriy,

W dniu 07.02.2016 o 21:35, Andriy Rysin pisze:
> Hi Marcin
>
> I was actually thinking for something even more abstract. To adjust
> your example:
> <token postag="verb">
>       <dictionary="valency" category="noun phrase" tag="accusative"/>
> </token>
>
> <token postag="verb">
>       <dictionary="somethingelse" category="some_cateogry"
> tag=".*xxx.*" tag_regexp="yes"/>
> </token>
>
> or
>
> <token postag="verb">
>       <dictionary="somethingelse" category="some_cateogry"
> regexp="yes">.*xxx.*</dictionary>
> </token>


I can see what you mean. This is a generalized approched, and I like it 
but to make it really valid XML we cannot use 
<dictionary="somethingelse">, as elements cannot be attributes at the 
same time. We would have to use:

<dictionary type="valency".../>

or

<dictionary id="valency" .../>

I think I prefer the latter approach. LT core would simply check on its 
list of dictionaries whether there's a dictionary with a given ID, and 
if yes, apply it. If not, fail (silently or not, we may think of 
different scenarios).
>
>
> In this case LT core would only know that there's set of extra
> "dictionary" information that has category and a value. This way we
> can add many different type of dictionaries, that could be specific to
> each language and rules for that language will use that
> language-specific information.
> Core LT will allow each language to provide this extra information and
> will not care much about its content.

Definitely, this is a good idea to keep in mind. All language 
maintainers could then add their own preferred dictionaries (like 
collocate dictionary, or synonym dictionary for that matter).

These things should be available also for querying when creating 
suggestions, shouldn't they?

Best,
Marcin

>
> Regards,
> Andriy
>
> 2016-02-02 14:40 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>> W dniu 02.02.2016 o 18:08, Andriy Rysin pisze:
>>> Hey Marcin
>>>
>>> this is great addition, though I have one remark. Besides valency
>>> information some other type of information could be useful too (if we
>>> starting to head this direction). E.g. I have rules in Ukrainian that
>>> suggests superlative form for adjective when "самий" (very) + base
>>> form is used. Currently I have the relation between base form and
>>> comparative/superlative forms encoded in the dictionary but in general
>>> this is higher-level information that should be stored outside of the
>>> tag dictionary.
>>
>> I would argue that in some languages (at least in Polish and English)
>> this is not a semantic-level information, this is a grammatical
>> information, or morphosyntactic information.
>>
>>>
>>> I am wondering if we could develop more generic approach for such
>>> additional (semantic) information, e.g. split each type of this info
>>> into category and allow generic references in the token/exception,
>>> something like this:
>>>
>>> <token 
>>> semantic_info="<semantic_category_name>:<WHATEVER_STRING_FROM_THAT_CATEGORY>"/>
>>>
>>> or even as a subelement (I assume semantic information can get pretty
>>> long/complicated so child element may be better choice and will allow
>>> to add new attributes easily on it later)
>>>
>>> <token>
>>> <semantic category="<semantic_category_name1>"
>>> value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/>
>>> <semantic category="<semantic_category_name2>"
>>> value="<WHATEVER_STRING_FROM_THAT_CATEGORY>"/>
>>> </token>
>>>
>>> so in valency case you described (1st case) it could be:
>>>
>>> <token postag="verb">
>>>     <semantic category="valence" value="WHATEVER_STRING"/>
>>> </token>
>>
>> Valency is definitely not a semantic category:
>>
>> https://en.wikipedia.org/wiki/Valency_(linguistics)
>>
>> But your approach seems quite elegant. I would argue that valency is one
>> kind of information that should be treated as key-value
>>
>> <token postag="verb">
>>       <valency category="noun phrase" value="accusative"/>
>> </token>
>>
>> This would match a verb that takes an accusative noun phrase (of course,
>> the values would be defined per valency lexicon in a language). There
>> are free valency lexicons for many languages beside Polish.
>>
>>>
>>> Thus if we add other semantic information into LT we can use this info
>>> in the logic without changing the LT core.
>>
>> The core XML parsing will have to be changed anyway.
>>
>> Best,
>> Marcin
>>
>>>
>>> Thanks
>>> Andriy
>>>
>>> 2016-01-28 7:30 GMT-05:00 Marcin Miłkowski <list-addr...@wp.pl>:
>>>> Hi all,
>>>>
>>>> To allow for better disambiguation and have better rules, I need to
>>>> include a valency dictionary with LT. These are dictionaries that
>>>> specify which grammatical cases or prepositions go with which verbs etc.
>>>> There are such resources for many languages that we support. And using
>>>> these resources, we could enrich POS tag disambiguation a lot (I'm using
>>>> a horribly long regular expression right now instead of a dictionary,
>>>> for example), and write up a lot of important rules.
>>>>
>>>> The obvious choice for representing the dictionary (which is available
>>>> for Polish on a fairly liberal license) is to use a finite-state lexicon
>>>> that we normally use for taggers. The dictionary will be applied after
>>>> tagging because valency dictionary will require POS tag + lexeme
>>>> information. In Polish, the entries look like this:
>>>>
>>>> absurdalny: pewny: : : : {prepnp(dla,gen)}
>>>> absurdalny: pewny: : : : {prepnp(w,loc)}
>>>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(gdy)}
>>>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(int)}
>>>> absurdalny: potoczny: : pred: : {prepnp(dla,gen)}+{cp(jak)}
>>>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(jeśli)}
>>>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(kiedy)}
>>>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(że)}
>>>> absurdalny: pewny: : pred: : {prepnp(dla,gen)}+{cp(żeby)}
>>>>
>>>> But for French (see http://bach.arts.kuleuven.be/dicovalence/) they are
>>>> paragraph-based:
>>>>
>>>> VAL$    abaisser: P0 P1
>>>> VTYPE$  predicator simple
>>>> VERB$   ABAISSER/abaisser
>>>> NUM$    10
>>>> EG$     il faudra abaisser la persienne
>>>> TR_DU$  laten zakken, neerhalen, neerlaten, doen dalen
>>>> TR_EN$  let down, lower
>>>> FRAME$  subj:pron|n:[hum], obj:pron|n:[nhum,?abs]
>>>> P0$     (que), qui, je, nous, elle, il, ils, on, (ça), (ceci), celui-ci, 
>>>> ceux-ci
>>>> P1$     que, la, le, les, en Q, ça, ceci, celui-ci, ceux-ci
>>>> RP$     passif être, se passif
>>>> AUX$    avoir
>>>>
>>>>
>>>> VAL$    abaisser: P0 (P1)
>>>> VTYPE$  predicator simple
>>>> VERB$   ABAISSER/abaisser
>>>> NUM$    20
>>>> EG$     il a raconté cette anecdote pour m'abaisser
>>>> TR_DU$  vernederen, kleineren
>>>> TR_EN$  humiliate
>>>> FRAME$  subj:pron|n:[hum], ?obj:pron|n:[hum]
>>>> P0$     (que), qui, je, nous, elle, il, ils, on, (ça), celui-ci, ceux-ci,
>>>> (ça(de_inf))
>>>> P1$     0, qui, te, vous, la, le, les, se réc., en Q, celui-ci, ceux-ci,
>>>> l'un l'autre
>>>> RP$     passif être, se faire passif
>>>> AUX$    avoir
>>>>
>>>>
>>>> I would also add a new list of valency attributes to every
>>>> AnalyzedToken, simply as a string value (parsing the string would be
>>>> overkill because there might be different ways of encoding valency
>>>> information for different languages), with appropriate getters and setters.
>>>>
>>>> To use this, we will need some additional attributes in XML elements:
>>>> token and exception. The following notation seems to be fairly fine:
>>>>
>>>> <token postag="verb" valence="WHATEVER_STRING"/>
>>>>
>>>> I think matching valence should, by default, use regular expressions.
>>>> I'm not sure if negation is needed.
>>>>
>>>> Alternatively, we could go for attribute-value pairs but the problem
>>>> would be how to make this language-independent and use this in our XML
>>>> notation. The easiest way would be then probably define types and their
>>>> ids just the way I did this for unification. For French, this could look
>>>> like:
>>>>
>>>> <token>abaisser<valency id="AUX">avoir</valency><valency
>>>> id="VTYPE">predicator simple</valency></token>
>>>>
>>>> Of course, the textual values of <valency> could be made into regexes
>>>> and negated.
>>>>
>>>> For this, of course we would need to write up a key-value parser for all
>>>> particular valency dictionaries. But this would definitely speed up
>>>> matching and would make writing rules much easier.
>>>>
>>>> Best
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>>> Monitor end-to-end web transactions and take corrective actions now
>>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>>>> _______________________________________________
>>>> Languagetool-devel mailing list
>>>> Languagetool-devel@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>> ------------------------------------------------------------------------------
>>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>>> Monitor end-to-end web transactions and take corrective actions now
>>> Troubleshoot faster and improve end-user experience. Signup Now!
>>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>>> _______________________________________________
>>> Languagetool-devel mailing list
>>> Languagetool-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Site24x7 APM Insight: Get Deep Visibility into Application Performance
>> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
>> Monitor end-to-end web transactions and take corrective actions now
>> Troubleshoot faster and improve end-user experience. Signup Now!
>> http://pubads.g.doubleclick.net/gampad/clk?id=267308311&iu=/4140
>> _______________________________________________
>> Languagetool-devel mailing list
>> Languagetool-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
> ------------------------------------------------------------------------------
> Site24x7 APM Insight: Get Deep Visibility into Application Performance
> APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
> Monitor end-to-end web transactions and take corrective actions now
> Troubleshoot faster and improve end-user experience. Signup Now!
> http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: Valency dictionary and attribute [long mail]

Reply via email to