Re: [Apertium-stuff] GSoC19 - Unsupervised weighting of automata progress update

Tommi A Pirinen Thu, 25 Jul 2019 02:37:12 -0700

On Thu, Jul 25, 2019 at 12:05:05AM +0100, Francis Tyers wrote:
> El 2019-07-24 14:34, Amr Mohamed Hosny Anwar escribió:
> > On 7/22/19 9:32 PM, Francis Tyers wrote:
> >> 
> >> How about using BPE to weight the possible analyses?
> >> 
> >> e.g.
> >> 
> >> 1) BPE will give you a segmentation it likes for a word,
> >> "arabasız>lar>da"
> >> 
> >> 2) analyser will give you various segmentations:
> >>      araba>sız>lar>da, arabasız>lar>da
> >> 
> >> 3) you weight the segmentations that disagree with BPE higher for each
> >> boundary
> >>      that isn't predicted by BPE
> >> 
> >> 
> >> F.
> >> 
> > Hi Francis,
> > 
> > I have checked the BPE segmentation paper.
> > 
> > The idea is easy to grasp but I think morphological analyzers' output
> > has a special format.
> > To use BPE I will need to drop the analysis tags such as "<n>" and 
> > "<sg>".
> > In order to validate that the BPE might be beneficial, I decided to
> > compute certain statistics from the tiny English corpus that I am 
> > using.
> > 
> > * Corpus size: 9098 tokens.
> > * Unambiguous tokens (unique analysis for the token): 5830 tokens (64%)
> > * Ambiguous tokens: 3268 tokens (36%)
> >      * Tokens having different segments: 533 tokens (5.86% of the 
> > corpus
> > - 16.3% out of the ambiguous tokens)
> >      Example:
> >          * Surface token: oscillating
> >          * Analyses:
> > oscillating<adj>/oscillate<vblex><pprs>/oscillate<vblex><subs>/oscillate<vblex><ger>
> >          * Segments: oscillating/oscillate/oscillate/oscillate
> 
> Well, here the segments would be:
> 
> oscillating
> oscillat>ing


There are few apertium languages that can be tweaked around to create
segmenters or segmenting labellers, probably all Turkic languages for
example.

Another approach I have used to induce segments is to use lexc structure
to inject segment boundaries, they will be as accurate as the lexc
writer has been systematic but can be a bit more helpful than assuming 1
tag is 1 segmentation point. E.g.:

LEXICON Root

koira<n>:koir0 n__koir/a ;

LEXICON n_koir/a

a:a n__sgnom ;
a:a n__sg_ws ;
a:0 n__pl_ws ;

LEXICON n__sg_ws

<sg><ine>:ssa n__poss_clits ;
<sg><ela>:sta n__poss_clits ;

etc. 

you can autogenerate a seg,enter that has koir>a>ssa and koir>i>sta and
so forth.

You can actually do the same with apertium monoodix and pardefs, but
they are traditionally flatter.

> > I am encouraged to use BPE in this way and I believe it won't make a 
> > big
> > difference.
> > Do you think these statistics will differ for languages such as German,
> > Turkish, Finish who seem to have more complex compounding than English?
> > 
> 
> Yes, they could be very different. I think Tommi has had some experience
> in integrating Morfessor (a similar system to BPE) and transducers.
> 
> Tommi, could you let us know what you tried?

Most of our work back in times of SMT was to throw the segments at the
machine like they were pos tags and hope for the best, I think
DCU/Alicante team has build on it for past few years so checking their
WMT papers[1] r that can be informative, I know we were working on BPE
implementation when I left.

I didn't try but it should be possible idea to connect the so created
segments to specific tags to help weighting or voting and get more
better scores, like, linguistically we know that all <ine> in Finnish
will be ssa or ssä and all <ela> will be sta or stä, some relatively
simple counting might work. Perhaps this is already in some more
advanced morfessing models like flatcat. 

In terms of background reading, the combination of unsupervised +
morphology is quite dominated by the segmentation idea so that will be
a good source of ideas. 

[1] <https://www.aclweb.org/anthology/sigs/sigmt/>
-- 
Doktor Tommi A Pirinen, Computational Linguist,
<https://flammie.github.io/purplemonkeydishwasher/>, Universität
Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D
Entwickler.  President of ACL SIGUR SIG for Uralic languages
<http://gtweb.uit.no/sigur/>.
I tend to follow inline-posting style in desktop e-mail messages.

signature.asc
Description: PGP signature

_______________________________________________
Apertium-stuff mailing list
Apertium-stuff@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/apertium-stuff

Re: [Apertium-stuff] GSoC19 - Unsupervised weighting of automata progress update

Reply via email to