On Thu, Jul 25, 2019 at 12:05:05AM +0100, Francis Tyers wrote: > El 2019-07-24 14:34, Amr Mohamed Hosny Anwar escribió: > > On 7/22/19 9:32 PM, Francis Tyers wrote: > >> > >> How about using BPE to weight the possible analyses? > >> > >> e.g. > >> > >> 1) BPE will give you a segmentation it likes for a word, > >> "arabasız>lar>da" > >> > >> 2) analyser will give you various segmentations: > >> araba>sız>lar>da, arabasız>lar>da > >> > >> 3) you weight the segmentations that disagree with BPE higher for each > >> boundary > >> that isn't predicted by BPE > >> > >> > >> F. > >> > > Hi Francis, > > > > I have checked the BPE segmentation paper. > > > > The idea is easy to grasp but I think morphological analyzers' output > > has a special format. > > To use BPE I will need to drop the analysis tags such as "<n>" and > > "<sg>". > > In order to validate that the BPE might be beneficial, I decided to > > compute certain statistics from the tiny English corpus that I am > > using. > > > > * Corpus size: 9098 tokens. > > * Unambiguous tokens (unique analysis for the token): 5830 tokens (64%) > > * Ambiguous tokens: 3268 tokens (36%) > > * Tokens having different segments: 533 tokens (5.86% of the > > corpus > > - 16.3% out of the ambiguous tokens) > > Example: > > * Surface token: oscillating > > * Analyses: > > oscillating<adj>/oscillate<vblex><pprs>/oscillate<vblex><subs>/oscillate<vblex><ger> > > * Segments: oscillating/oscillate/oscillate/oscillate > > Well, here the segments would be: > > oscillating > oscillat>ing
There are few apertium languages that can be tweaked around to create segmenters or segmenting labellers, probably all Turkic languages for example. Another approach I have used to induce segments is to use lexc structure to inject segment boundaries, they will be as accurate as the lexc writer has been systematic but can be a bit more helpful than assuming 1 tag is 1 segmentation point. E.g.: LEXICON Root koira<n>:koir0 n__koir/a ; LEXICON n_koir/a a:a n__sgnom ; a:a n__sg_ws ; a:0 n__pl_ws ; LEXICON n__sg_ws <sg><ine>:ssa n__poss_clits ; <sg><ela>:sta n__poss_clits ; etc. you can autogenerate a seg,enter that has koir>a>ssa and koir>i>sta and so forth. You can actually do the same with apertium monoodix and pardefs, but they are traditionally flatter. > > I am encouraged to use BPE in this way and I believe it won't make a > > big > > difference. > > Do you think these statistics will differ for languages such as German, > > Turkish, Finish who seem to have more complex compounding than English? > > > > Yes, they could be very different. I think Tommi has had some experience > in integrating Morfessor (a similar system to BPE) and transducers. > > Tommi, could you let us know what you tried? Most of our work back in times of SMT was to throw the segments at the machine like they were pos tags and hope for the best, I think DCU/Alicante team has build on it for past few years so checking their WMT papers[1] r that can be informative, I know we were working on BPE implementation when I left. I didn't try but it should be possible idea to connect the so created segments to specific tags to help weighting or voting and get more better scores, like, linguistically we know that all <ine> in Finnish will be ssa or ssä and all <ela> will be sta or stä, some relatively simple counting might work. Perhaps this is already in some more advanced morfessing models like flatcat. In terms of background reading, the combination of unsupervised + morphology is quite dominated by the segmentation idea so that will be a good source of ideas. [1] <https://www.aclweb.org/anthology/sigs/sigmt/> -- Doktor Tommi A Pirinen, Computational Linguist, <https://flammie.github.io/purplemonkeydishwasher/>, Universität Hamburg, Hamburger Zentrum für Sprachkorpora <http://hzsk.de>. CLARIN-D Entwickler. President of ACL SIGUR SIG for Uralic languages <http://gtweb.uit.no/sigur/>. I tend to follow inline-posting style in desktop e-mail messages.
signature.asc
Description: PGP signature
_______________________________________________ Apertium-stuff mailing list Apertium-stuff@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/apertium-stuff