El dc 13 de 03 de 2013 a les 09:08 +0100, en/na Mikel L. Forcada va escriure: > Hi all, > I answer Fran's message here. I'll answer Jacob's in a minute. > >> (1) Sliding-window part-of-speech tagger. The idea is to implement the > >> unsupervised part-of-speech tagger > >> (http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging) > >> as a drop-in replacement for the current hidden-Markov-model tagger. > >> Ideally, it should have support for unknown words, and also for "forbid" > >> descriptions (not described in the paper). The tagger has a very intuitive > >> interpretation (believe me, even if you find the maths a bit daunting). I > >> am available for questions (I invented the tagger, I should be able to > >> remember!). > > I think this would make a great project, we really need improved > > morphological disambiguation in canonical Apertium. It's particularly > > nice in that it can be represented as an FST (hopefully with > > lttoolbox). > It can, but one would have to think how to spit out the rules in a > format which would be similar to the one lrx-proc uses now. Perhaps the > FST processor in lrx-proc could be used for these kinds of rules too.
I think some of the code could definitely be reused, or at least give some ideas. > > I'm not sure if I quite understand the results in the paper though. The > > performance of the tagger was better than bigram HMM trained with > > Baum-Welch, but even then had a ~35% error rate ? > > §5: "Figures show the average correct-tag rate only over ambiguous words > (non-ambiguous words are not counted as successful disambiguations)." So, to give an idea of the success rate in overall terms: 1000 words 300 ambiguous (from the paper) 65% success rate on ambiguous words = 105 words mistagged = 89.5% accuracy or around 10.5 word in 100 mistagged. Would that be about right ? This compares to 88% for the Baum-Welch HMM trained on the same corpus, which would be 12 words in 100 mistagged. Would the sliding-window tagger be able to incorporate rules with variable length contexts, or would it be restricted to rules the side of the window ? (e.g. in the -1/+1 = 3 grams.) I've added the task as: "Sliding-window part-of-speech tagger " http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Sliding-window_part-of-speech_tagger > > (3) A preprocessor or compiler to avoid having to write structural > > transfer (i.e., .t1x, .t2x and .t3x) rules in raw XML which is very > > overt and clear, but clumsy and hard to write. Before Apertium, in > > interNOSTRUM.com we had a language for .t1x-style files called > > MorphTrans, which is described in the > > paper > > http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/download/3355/1843 > > . I believe > this language is much easier to write; it should be upgraded and documented. > The preprocessor would > read .mt1, .mt2, and .mt3 files in MorphTrans-style format (with keywords in > English) and > generate the current XML. There would also be the opposite tool (much easier > to write as > an XSLT stylesheet) to generate MorphTrans-style code from current XML code. > Morphtrans can of > course be redesigned a bit, and, in fact, it should. > > I love the "si ... altrament" :D > > > > And yes, this would be a fine project I think. One of the challenges > > would be writing the validator though. > Since one would have to write a proper compiler to generate XML from > MorphTrans (eg, using bison and flex as we did in interNOSTRUM), syntax > errors would be reported by the compiler itself (they did in > internostrum), and would prevent further processing. Great! I've added it. > >> (3') The same for .dix files. Two roundtrip converters to use the old > >> interNOSTRUM-style format > >> (http://www.sepln.org/revistaSEPLN/revista/25/25-Pag93.pdf), which is > >> much easier to write. > > imho, this is basically reinventing lexc -- were there validators > > available for it ? > It may be similar to lexc but I think our notation was easier and runs > entirely parallel to .dix files now (.dix files are an XML rewriting > —and slight extension— of the old interNOSTRUM format). " Plain-text formats for Apertium data " on the ideas page, and: http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Plain-text_formats_for_Apertium_data > >> (5) Extending the .dix language (and modifying lt-proc or writing a > >> pre-processor to it) to be able to deal with the kind of stuff that > >> some people miss in the .dix (and .metadix) formats and makes them use > >> HFST which means that people have to mix two different dictionary > >> formats in the same language pair. And yes, of course, having > >> something that translates the current HFST format to the new superdix > >> format. Yes, you guessed, I'd love to throw HFST off board. I can > >> tolerate it as a temporary heresy to keep the church of Apertium > >> together, but, as co-pope [1], I'd like to canonicalize Apertium in > >> the end. And it would be easier to deal with prefixes hey Jonathan? > > Yes, this is a great idea too! It's partly taken into account in: > > > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Flag_diacritics_in_lttoolbox > > > > "flag diacritics" is a bit of an odd term which basically means > > "constraints which forbid/enforce certain non-adjacent morpheme > > combinations". > > > > and also, partially in: > > > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Closer_integration_with_HFST > > > > Closer integration sounds a bit "ecumenical", but actually, the first > > point is about coming up with a way of representing things like > > archiphonemes in an lttoolbox-like fashion. > I had forgotten about those previous efforts; all this info should be > added to the ideas page. Everything seems to push in the same direction, > and it would be a nice step toward c14n (canonicalization). Yes, I'll get to it. > > Feel free to edit these pages, adding your own ideas. Or we could just > > add a new page. > > > >> (6) Tools to order .dixes and point at "bad coding style" (which would > >> have to be defined). My collection is that the current .dix format is > >> too powerful and allows almost anything. I have to think more about > >> this idea, but I couldn't help throwing it out at you. > > We have the idea "lint for Apertium " which is quite similar to this > > one. > > > > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/lint_for_Apertium > This "lint" idea is more about detecting errors. I was talking rather > about "style", maintainability, etc. Do you think they could be merged into one idea ? I think they might end up sharing quite a bit of code... Fran ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ Apertium-stuff mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/apertium-stuff
