Fran,
thanks for the long message. You are right, patching and heterogeneity 
is a result of the lack of developer time. One of the sources of 
developer time we have is GSoC. Also, as you know, we are applying from 
European projects: we could try to secure some development time there. 
The issues we discussed should definitely be part of these two agendas.

[Snip, on TSX rules]

> True, but the rules are fairly restrictive, allowing only bigram
> contexts.
True.
> In Atro Voutilainen's "Hand Crafted Rules", he gives numbers
> saying that in the (famous) EngCG tagger, 10% of rules have unbounded
> contexts, and 21% have a condition that is not a neighbouring word. This
> may seem like a low number, but these are exactly the kind of problems
> (non-neighbouring words) that we are up against, and that cannot be
> taken care of with bigram rules.
>
> $ echo "He very rarely looks that way." | apertium -d . en-es-tagger
> ^Prpers<prn><subj><p3><m><sg>$ ^very<preadv>$ ^rarely<adv>$
> ^look<vblex><pri><p3><sg>$ ^that<cnjsub>$ ^way<n><sg>$^.<sent>$^.<sent>$
>
> In principle the finite verb + cnjsub and cnjsub + noun and noun + sent
> readings are fine. The problem is that the cnjsub + noun + sent is
> problematic.
Good example.
>
>>> Might help us get around tagging errors like:
>>>
>>> $ echo "Avui no veig el sol." | apertium -d . ca-en-tagger
>>> ^Avui<adv>$ ^no<adv>$ ^veure<vblex><pri><p1><sg>$ ^el<det><def><m><sg>$
>>> ^sol<adj><m><sg>$^.<sent>$^.<sent>$
>> Fran, what would be a reasonable "forbidding" rule here that repairs
>> this error but does not break things somewhere else?
> I would write a rule to say:
>
> If there is an ambiguity between the sequence "definite article +
> adjective/noun + sentence boundary" choose the "det noun sent" reading.
>
> The case where you can have det + adj + sent is where the adjective is
> "nominalised".
I think there is an easier explanation: the adjective is still an 
adjective but the noun is missing. Ellipsis. Don't mess with word 
categories!
> So if you already have a noun, it is _probably_ better to
> choose this.
>
> The other option would be to make a lexical rule for "sol". I can come
> up with examples where this would be wrong (e.g. (?)"Hi ha més sols que
> brillen aixina ? No és el sol.") but these are a bit rebuscado.
No, they are OK. But I think we should do our best to distinguish 
"preferences" from "restrictions". I think the rule to choose "det noun 
sent" could delete some valid contexts: for instance "el dependiente" 
could be either "det noun" or "det adj" and has different meanings: "the 
shopkeeper" / "the dependent".
>
> In the Catalan Wikipedia corpus, the only examples of "el sol ." are of
> the Sun.
>
> If you can find a corpus where "el sol ." as "the only one ." exceeds
> "the Sun ." I would be interested to see it :)
You're grand there with "sol", but other cases may be different.

[snip]

> Part of the problem is that there is no development to apertium-tagger,
> bugs take a long time to find and fix, and no improvement work has been
> done since 2009. On the other hand, CG3 has weekly commits by an active
> developer, and bugs are fixed in days/hours instead of weeks/months.
We should think about ways to secure development time for apertium-tagger.
>
>
>
> Improving our own tools just hasn't been a (research) objective of the
> Apertium (research) community. -- The idea (as I understand it) has more
> been "making the best of what we have".
That's right. Paid developers are busy doing other things, so we need to 
find people and money. Not easy in 2012.

[Snip]
>
>
> Furthermore, more discouraging for potential developers would be having
> to write a morphological dictionary for their language without the
> appropriate tools.
True.
>
>> I am currently helping develop apertium-eng-kaz with three
>> Kazakh students and the complexity shown by this module makes it harder
>> than I thought to explain.
> Which module ? The CG or the HFST ? Or both ?
HFST. With %<things%> like this, and *two* dictionaries (twol, lexc).
>
>> In the past, stubbornly sticking to some design tenets such as "vintage"
>> 70's Unix-style pipelines and text formats has, in my opinion,
>> contributed to having a lean, clear, homogeneous engine.
> But in many cases non-homogeneous language pair data. The metadix format
> for example. Explaining that is easily as tricky as explaining vislcg3.
But much easier than reading HFST dictionaries (let alone explaining them!)

[snip]
>
>>   One success of
>> that is the development of multi-level transfer, with all its defects.
>> That's why I will stubbornly defend canonicality!
>>
>> I hope you get the point.
> Yes definitely. And I also defend canonicality, but at the same time I
> want to offer the best and most productive tools to apertiumers.
>
> http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code
>
> * Rule-based finite-state disambiguation (currently Hrvoje is working on
> it)
Terrific. I'll stay tuned to this development.
> * Flag diacritics in lttoolbox
One day I will ask someone to explain to me what's there in HFST that is 
needed and that could not easily be dealt with in (a metadix format for) 
lttoolbox, because I still don't get it, at least when I look at Kazakh 
files.

Cheers

Mikel

-- 
Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/)
Departament de Llenguatges i Sistemes Informàtics
Universitat d'Alacant
E-03071 Alacant, Spain
Phone: +34 96 590 9776
Fax: +34 96 590 9326


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to