On Tue, Jul 05, 2011 at 03:51:42PM +0000, Francis Tyers wrote: > El dt 05 de 07 de 2011 a les 17:42 +0200, en/na Keld Jørn Simonsen va > escriure: > > On Tue, Jul 05, 2011 at 03:24:09PM +0000, Francis Tyers wrote: > > > El dt 05 de 07 de 2011 a les 16:49 +0200, en/na Mikel Forcada va > > > escriure: > > > > Hi there, > > > > > I would like to attach attributes to lemmas. Only a few but maybe > > > > > there > > > > > could be more, so a kind of introducing an attribute name would be > > > > > nice, > > > > > instead of having a predefined set of attribute names.. > > > > Lemmas as such aren't represented as such in Apertium dictionaries. > > > > They > > > > are part of the lexical forms (one could say that the lemma is the > > > > material from the beginning of the lexical form up to where the first > > > > part-of-speech tag appears. For instance, for surface form "thought" an > > > > English dictionary would derive the lexical forms "thought<n><sg>" and > > > > "think<vblex>...". The lemmas would then be "thought" and "think". > > > > There > > > > is a attribute lm="...." in some entries, but it is optional. > > > > > I believe there are already lemma attributes, such as the word class > > > > > of > > > > > the lemma: noun, verb, adjective, adverb etc. > > > > Not for lemmas. Lemma information is encoded either as the content of > > > > the element (see above). Part of speech as well as other morphological > > > > information is encoded as attributes of the <s> (symbol element). > > > > > what I have in mind is to attach data from wordnet, such as sense, > > > > > hypernym, hyponum, holonym, meromnym, and also combine it with the > > > > > Swedish SALDO attributes of father and mother relations. > > > > > > > > > > The idea is then to choose a sense of a homonym based on the shortest > > > > > distance to maybe the previous and following five words. > > > > > > Which language pair(s) are you working with ? Is it really necessary ? > > > > sv-da. I cannot get further with my work without such features - at least > > in a > > rudymentary form. > > > > > > I think it would be fun. Anyway there are of cause problems with > > homonyms in swedish and danish that could be better solved with more > > intelligent selection machinery. > > > > I have about 40000 new swedish words that I have used quite some time on > > and they should not damage the already existing work. > > In any case if you do mass addition you're probably going to damage what > is there.
Why? Apart for making the collection bigger and thus slowing down the system. If there are no conflicts then there is no harm - in a mathematical sense. > > > > > a lemma may have more than one sense. Eg 'nut' may mean several things > > > > > such as the offspring of a plant, nuts and bolts, and testicles. > > > > > > > > > > Is this easy to do? How do I do it? > > > > I think the attribute lm="...." could be stretched a bit to have any > > > > value, which could be used to identify the lemma in another structure > > > > which could contain all of these (for instance, giving an XPath to > > > > another XML file containing all the desired information). > > > > > > > > Perhaps it would be better to have some kind of new general purpose > > > > attribute that could be used to attach *standoff* information of this > > > > kind to any entry <e>. > > > > > > I think that might be nice... also for, for example verb valency or > > > other features that we don't necessarily want to represent with tags. > > > > But maybe we can just do it with tags. Is it possible to add arbitrary > > tags? > > Yes it is possible to add arbitrary tags. > > > > > Fran is working on lexical selection and I'm sure his opinion would be > > > > interesting to read! > > > > > > Could also use the attribute 'c' for comment. > > > > I think it would be misleading to call it a comment. Maybe <a> - > > attribute? > > Sure, but to start with you can use c="" and later if you have good > results we can add a="" to the DTD. > > > > > > > in the instance that a word has the same lemma/pos/gender and different > > > paradigms/declensions, I use a pseudo lemma, for example from Russian: > > > > > > <e lm="????????"><i>????????</i><par n="????????__n_m_nn"/></e> > > > <e lm="????????"><p><l>????????</l><r>????????:1</r></p><par > > > n="??????__n_m_aa"/></e> > > > > I would then still need a way to disambiguate and chose the right one. > > Yep, but it's a tiny part of the problem. A more sensible place to start > would be adding compounding support, like nn-nb and af-nl have. I am at this time more interesed in saving the several weeks of work I have already done, and go forward with what I already have planned on integrating SALDO data. best regards keld ------------------------------------------------------------------------------ All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
