Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Francis Tyers Wed, 13 Mar 2013 03:16:25 -0700

El dc 13 de 03 de 2013 a les 09:08 +0100, en/na Mikel L. Forcada va
escriure:
> Hi all,
> I answer Fran's message here. I'll answer Jacob's in a minute.
> >> (1) Sliding-window part-of-speech tagger. The idea is to implement the
> >> unsupervised part-of-speech tagger
> >> (http://en.wikipedia.org/wiki/Sliding_window_based_part-of-speech_tagging) 
> >> as a drop-in replacement for the current hidden-Markov-model tagger. 
> >> Ideally, it should have support for unknown words, and also for "forbid" 
> >> descriptions (not described in the paper). The tagger has a very intuitive 
> >> interpretation (believe me, even if you find the maths a bit daunting). I 
> >> am available for questions (I invented the tagger, I should be able to 
> >> remember!).
> > I think this would make a great project, we really need improved
> > morphological disambiguation in canonical Apertium.  It's particularly
> > nice in that it can be represented as an FST (hopefully with
> > lttoolbox).
> It can, but one would have to think how to spit out the rules in a 
> format which would be similar to the one lrx-proc uses now. Perhaps the 
> FST processor in lrx-proc could be used for these kinds of rules too.


I think some of the code could definitely be reused, or at least give
some ideas.

> > I'm not sure if I quite understand the results in the paper though. The
> > performance of the tagger was better than bigram HMM trained with
> > Baum-Welch, but even then had a ~35% error rate ?
> 
> §5: "Figures show the average correct-tag rate only over ambiguous words 
> (non-ambiguous words are not counted as successful disambiguations)."

So, to give an idea of the success rate in overall terms:

1000 words
300 ambiguous (from the paper)
65% success rate on ambiguous words = 105 words mistagged

= 89.5% accuracy

or around 10.5 word in 100 mistagged.  Would that be about right ? 

This compares to 88% for the Baum-Welch HMM trained on the same corpus,
which would be 12 words in 100 mistagged.  

Would the sliding-window tagger be able to incorporate rules with
variable length contexts, or would it be restricted to rules the side of
the window ? (e.g. in the -1/+1 = 3 grams.)

I've added the task as: "Sliding-window part-of-speech tagger "

http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Sliding-window_part-of-speech_tagger

> > (3) A preprocessor or compiler to avoid having to write structural
> > transfer (i.e., .t1x, .t2x and .t3x) rules in raw XML which is very
> > overt and clear, but clumsy and hard to write. Before Apertium, in
> > interNOSTRUM.com we had a language for .t1x-style files called
> > MorphTrans, which is described in the
> > paper 
> > http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/download/3355/1843
> >  . I believe 
> this language is much easier to write; it should be upgraded and documented. 
> The preprocessor would 
> read .mt1, .mt2, and .mt3 files in MorphTrans-style format (with keywords in 
> English) and 
> generate the current XML. There would also be the opposite tool (much easier 
> to write as 
> an XSLT stylesheet) to generate MorphTrans-style code from current XML code. 
> Morphtrans can of
>  course be redesigned a bit, and, in fact, it should.
> > I love the "si ... altrament" :D
> >
> > And yes, this would be a fine project I think. One of the challenges
> > would be writing the validator though.
> Since one would have to write a proper compiler to generate XML from 
> MorphTrans (eg, using bison and flex as we did in interNOSTRUM), syntax 
> errors would be reported by the compiler itself (they did in 
> internostrum), and would prevent further processing.

Great! I've added it.

> >> (3') The same for .dix files. Two roundtrip converters to use the old
> >> interNOSTRUM-style format
> >> (http://www.sepln.org/revistaSEPLN/revista/25/25-Pag93.pdf), which is
> >> much easier to write.
> > imho, this is basically reinventing lexc -- were there validators
> > available for it ?
> It may be similar to lexc but I think our notation was easier and runs 
> entirely parallel to .dix files now (.dix files are an XML rewriting 
> —and slight extension— of the old interNOSTRUM format).

" Plain-text formats for Apertium data "

on the ideas page, and:

http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Plain-text_formats_for_Apertium_data

> >> (5) Extending the .dix language (and modifying lt-proc or writing a
> >> pre-processor to it) to be able to deal with the kind of stuff that
> >> some people miss in the .dix (and .metadix) formats and makes them use
> >> HFST which means that people have to mix two different dictionary
> >> formats in the same language pair. And yes, of course, having
> >> something that translates the current HFST format to the new superdix
> >> format. Yes, you guessed, I'd love to throw HFST off board. I can
> >> tolerate it as a temporary heresy to keep the church of Apertium
> >> together, but, as co-pope [1], I'd like to canonicalize Apertium in
> >> the end. And it would be easier to deal with prefixes hey Jonathan?
> > Yes, this is a great idea too! It's partly taken into account in:
> >
> > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Flag_diacritics_in_lttoolbox
> >
> > "flag diacritics" is a bit of an odd term which basically means
> > "constraints which forbid/enforce certain non-adjacent morpheme
> > combinations".
> >
> > and also, partially in:
> >
> > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/Closer_integration_with_HFST
> >
> > Closer integration sounds a bit "ecumenical", but actually, the first
> > point is about coming up with a way of representing things like
> > archiphonemes in an lttoolbox-like fashion.
> I had forgotten about those previous efforts; all this info should be 
> added to the ideas page. Everything seems to push in the same direction, 
> and it would be a nice step toward c14n (canonicalization).

Yes, I'll get to it.

> > Feel free to edit these pages, adding your own ideas. Or we could just
> > add a new page.
> >
> >> (6) Tools to order .dixes and point at "bad coding style" (which would
> >> have to be defined). My collection is that the current .dix format is
> >> too powerful and allows almost anything. I have to think more about
> >> this idea, but I couldn't help throwing it out at you.
> > We have the idea "lint for Apertium " which is quite similar to this
> > one.
> >
> > http://wiki.apertium.org/wiki/Ideas_for_Google_Summer_of_Code/lint_for_Apertium
> This "lint" idea is more about detecting errors. I was talking rather 
> about "style", maintainability, etc.

Do you think they could be merged into one idea ? I think they might end
up sharing quite a bit of code...

Fran


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Apertium-stuff mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] extra ideas for GSOC: getting the ball rolling

Reply via email to