El 2019-03-06 14:40, Antonio Toral escribió:
Dear apertiumers,

I would like to do morph segmentation for Kazakh and I've seen that
this is possible with Apertium [1].

However, in the example shown in that webpage the output doesn't seem
to be pure segmentation:

$ echo "щеткадағы" | hfst-proc kaz.segmenter
^щеткадағы/щетка>{D}{A}{G}{I}$

Is it possible to obtain segmentation instead? I.e.
щетка>дағы

Hi Antonio,

Thanks for your email! :D

You're right that it isn't pure segmentation. There is some good news
and some bad news.

The good news is that getting the 'pure' segmentation is definitely possible
and without too much effort.

Essentially the problem is that the way the
phonological rules are defined, some of them depend on 0 (empty) symbols
on the surface side of the string. The morpheme boundary currently always
goes to empty, so if we set it to not go to empty, then some of those
rules will break.

Fixing that means editting the rules to change the relevant contexts to ask for
0 aside from the morpheme boundary on the surface. This shouldn't take
too long.

The bad news is that it isn't done yet, but given the fact that
it Kazakh is in WMT this year, it is definitely something we are planning
to implement. Hopefully in the next couple of days.

Regards,

Fran




_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Reply via email to