Hi all, I answer Fran's message here. I'll answer Jacob's in a minute. >> (1) Sliding-window part-of-speech tagger. The idea is to implement the >> unsupervised part-of-speech tagger >> (http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging) >> as a drop-in replacement for the current hidden-Markov-model tagger. >> Ideally, it should have support for unknown words, and also for "forbid" >> descriptions (not described in the paper). The tagger has a very intuitive >> interpretation (believe me, even if you find the maths a bit daunting). I am >> available for questions (I invented the tagger, I should be able to >> remember!). > I think this would make a great project, we really need improved > morphological disambiguation in canonical Apertium. It's particularly > nice in that it can be represented as an FST (hopefully with > lttoolbox). It can, but one would have to think how to spit out the rules in a format which would be similar to the one lrx-proc uses now. Perhaps the FST processor in lrx-proc could be used for these kinds of rules too. > > I'm not sure if I quite understand the results in the paper though. The > performance of the tagger was better than bigram HMM trained with > Baum-Welch, but even then had a ~35% error rate ?
§5: "Figures show the average correct-tag rate only over ambiguous words (non-ambiguous words are not counted as successful disambiguations)." > > (3) A preprocessor or compiler to avoid having to write structural > transfer (i.e., .t1x, .t2x and .t3x) rules in raw XML which is very > overt and clear, but clumsy and hard to write. Before Apertium, in > interNOSTRUM.com we had a language for .t1x-style files called > MorphTrans, which is described in the > paper > http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/download/3355/1843 > . I believe this language is much easier to write; it should be upgraded and > documented. The preprocessor would read .mt1, .mt2, and .mt3 files in > MorphTrans-style format (with keywords in English) and generate the current > XML. There would also be the opposite tool (much easier to write as an XSLT > stylesheet) to generate MorphTrans-style code from current XML code. > Morphtrans can of course be redesigned a bit, and, in fact, it should. > I love the "si ... altrament" :D > > And yes, this would be a fine project I think. One of the challenges > would be writing the validator though. Since one would have to write a proper compiler to generate XML from MorphTrans (eg, using bison and flex as we did in interNOSTRUM), syntax errors would be reported by the compiler itself (they did in internostrum), and would prevent further processing. > >> (3') The same for .dix files. Two roundtrip converters to use the old >> interNOSTRUM-style format >> (http://www.sepln.org/revistaSEPLN/revista/25/25-Pag93.pdf), which is >> much easier to write. > imho, this is basically reinventing lexc -- were there validators > available for it ? It may be similar to lexc but I think our notation was easier and runs entirely parallel to .dix files now (.dix files are an XML rewriting —and slight extension— of the old interNOSTRUM format). > >> (5) Extending the .dix language (and modifying lt-proc or writing a >> pre-processor to it) to be able to deal with the kind of stuff that >> some people miss in the .dix (and .metadix) formats and makes them use >> HFST which means that people have to mix two different dictionary >> formats in the same language pair. And yes, of course, having >> something that translates the current HFST format to the new superdix >> format. Yes, you guessed, I'd love to throw HFST off board. I can >> tolerate it as a temporary heresy to keep the church of Apertium >> together, but, as co-pope [1], I'd like to canonicalize Apertium in >> the end. And it would be easier to deal with prefixes hey Jonathan? > Yes, this is a great idea too! It's partly taken into account in: > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Flag_diacritics_in_lttoolbox > > "flag diacritics" is a bit of an odd term which basically means > "constraints which forbid/enforce certain non-adjacent morpheme > combinations". > > and also, partially in: > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Closer_integration_with_HFST > > Closer integration sounds a bit "ecumenical", but actually, the first > point is about coming up with a way of representing things like > archiphonemes in an lttoolbox-like fashion. I had forgotten about those previous efforts; all this info should be added to the ideas page. Everything seems to push in the same direction, and it would be a nice step toward c14n (canonicalization). > > Feel free to edit these pages, adding your own ideas. Or we could just > add a new page. > >> (6) Tools to order .dixes and point at "bad coding style" (which would >> have to be defined). My collection is that the current .dix format is >> too powerful and allows almost anything. I have to think more about >> this idea, but I couldn't help throwing it out at you. > We have the idea "lint for Apertium " which is quite similar to this > one. > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/lint_for_Apertium This "lint" idea is more about detecting errors. I was talking rather about "style", maintainability, etc. Cheers Mikel -- Mikel L. Forcada (http://www.dlsi.ua.es/~mlf/) Departament de Llenguatges i Sistemes Informàtics Universitat d'Alacant E-03071 Alacant, Spain Phone: +34 96 590 9776 Fax: +34 96 590 9326 ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
