El dl 18 de 06 de 2012 a les 06:47 +0200, en/na Mikel Forcada va escriure: > Fran, > thanks for the long message. You are right, patching and heterogeneity > is a result of the lack of developer time. One of the sources of > developer time we have is GSoC.
The problem there is that we don't have the mentor time for it. > Also, as you know, we are applying from European projects: we could try > to secure some development time there. That would be an excellent idea! > The issues we discussed should definitely be part of these two agendas. Yes, definitely agree. [snip] > > > > The case where you can have det + adj + sent is where the adjective is > > "nominalised". > I think there is an easier explanation: the adjective is still an > adjective but the noun is missing. Ellipsis. Don't mess with word > categories! In this example yes, I think we can call it ellipsis. But I'm not sure that all examples could be analysed the same way. Although, if we consider examples like: Enséñame el rojo / Enséñame la roja. It is obvious that the adjective is making reference to a given object/noun. > > So if you already have a noun, it is _probably_ better to > > choose this. > > > > The other option would be to make a lexical rule for "sol". I can come > > up with examples where this would be wrong (e.g. (?)"Hi ha més sols que > > brillen aixina ? No és el sol.") but these are a bit rebuscado. > No, they are OK. But I think we should do our best to distinguish > "preferences" from "restrictions". Agree. But this stuff can be tested by looking at corpora. e.g. I took the top-100478 lines from the Catalan Wikipedia corpus. I translated them with the normal tagger (from en-ca) and with the normal tagger + 1 CG rule: SELECT (n) IF (-1C (det def)) (0 (adj) OR (n)) (1 (sent)) (NOT 0 (n acr)) ; A fairly simple rule which says: if a given word can be either a noun or an adjective, and if the word to the left is unambiguously a definite article and the word to the right is a <sent> then choose the noun reading -- so long as it is not an acronym. Seems that for "sol" it is a good rule, for "nou" it is definitely not a good rule, and for other adjective/noun pairs it is ok. http://pastebin.com/raw.php?i=zcp0F92M (save and use colourdiff for easier reading). > I think the rule to choose "det noun > sent" could delete some valid contexts: for instance "el dependiente" > could be either "det noun" or "det adj" and has different meanings: "the > shopkeeper" / "the dependent". The question really is if the rule would be wrong more times than it is right, and if it would produce more shocking translations, or less shocking ones. > > Part of the problem is that there is no development to apertium-tagger, > > bugs take a long time to find and fix, and no improvement work has been > > done since 2009. On the other hand, CG3 has weekly commits by an active > > developer, and bugs are fixed in days/hours instead of weeks/months. > We should think about ways to secure development time for apertium-tagger. True. But more than that, we should think what development we want for it. It doesn't take much to come up with ideas. But we currently don't have any that I know of. > > Improving our own tools just hasn't been a (research) objective of the > > Apertium (research) community. -- The idea (as I understand it) has more > > been "making the best of what we have". > That's right. Paid developers are busy doing other things, so we need to > find people and money. Not easy in 2012. Indeed :( > >> I am currently helping develop apertium-eng-kaz with three > >> Kazakh students and the complexity shown by this module makes it harder > >> than I thought to explain. > > Which module ? The CG or the HFST ? Or both ? > HFST. With %<things%> like this, and *two* dictionaries (twol, lexc). There is only one dictionary, the .lexc file -- the .twol file are orthographical rules. The .lexc file is basically the same as lttoolbox .dix, as described on this page: http://wiki.apertium.org/wiki/Lttoolbox_and_lexc If the idea of archiphonemes is confusing, then this is a problem of grammars, not the format. If the grammar says the plural form is -LAr, it is not particularly more complicated to say that the form is -LAr in the .lexc. I'm not defending lexc, it really is an ugly way of writing dictionaries, and has no validation. But I don't think it is more complicated than lttoolbox to explain. > >> In the past, stubbornly sticking to some design tenets such as "vintage" > >> 70's Unix-style pipelines and text formats has, in my opinion, > >> contributed to having a lean, clear, homogeneous engine. > > But in many cases non-homogeneous language pair data. The metadix format > > for example. Explaining that is easily as tricky as explaining vislcg3. > But much easier than reading HFST dictionaries (let alone explaining them!) > > [snip] > > > >> One success of > >> that is the development of multi-level transfer, with all its defects. > >> That's why I will stubbornly defend canonicality! > >> > >> I hope you get the point. > > Yes definitely. And I also defend canonicality, but at the same time I > > want to offer the best and most productive tools to apertiumers. > > > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code > > > > * Rule-based finite-state disambiguation (currently Hrvoje is working on > > it) > Terrific. I'll stay tuned to this development. > > * Flag diacritics in lttoolbox > One day I will ask someone to explain to me what's there in HFST that is > needed and that could not easily be dealt with in (a metadix format for) > lttoolbox, because I still don't get it, at least when I look at Kazakh > files. Example of flag diacritics: verb stem + -iš- + [all non-finite forms] + [only p3.pl] We want to be able to only allow third person plural forms following an -iš- morpheme, but without duplicating all the final paradigms. So after the -iš- we have an invisible symbol @enforce:onlyp3pl@ and then after the non-p3 forms we have @reject:onlyp3pl@, so any paths where you have -iš- + a personal form which isn't p3.pl are not printed out in the final analysis. There is another example here: http://wiki.apertium.org/wiki/Development_ideas_for_dictionary_format Fran ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
