Hi Andriy,

well, the standard way to tag such words is to assign them a special POS
tag. For example, in the Polish tagset, we use "burk" (which is an
abbreviation for "Burkina Faso" -- the word "Burkina" cannot function  in
Polish without "Faso"). Then you'll know that this is a function word in
need of another. I don't think you should change tokenization here, and I
think this keeps things simple, as there are really hard cases in case of
ambiguous words, and you'll have to adapt the tokenizer to assign several
tokenizing interpretations. This is of course possible but it would require
to replace the list of tokens with a complex graph...

Regards,
Marcin


2014-03-20 18:05 GMT+01:00 Andriy Rysin <ary...@gmail.com>:

> Thanks Jaume, I'll try how it works a bit later (I've already added
> "бен" to dictionary so I could push my changes).
> One more question around this is if I don't want to have "бен" as a
> separate word ahow can I mark "бен Ладен" (bin Laden) as a single
> word/noun? I use multiwords.txt to mark similar phrases but in this
> example Ладен can be inflected, I think multiwords does not support
> that (without writing all 7 forms of it in). Or is there better
> approach do treat these not-really-a-word-by-itself situations?
>
> Thanks
> Andriy
>
> 2014-03-20 4:19 GMT-04:00 Jaume Ortolà i Font <jaumeort...@gmail.com>:
> > 2014-03-20 8:46 GMT+01:00 Daniel Naber <daniel.na...@languagetool.org>:
> >
> >> On 2014-03-19 22:10, Andriy Rysin wrote:
> >>
> >> > For example I have "Бен" (Ben) defined as man's name in the dictionary
> >> > but not "бен" (ben) so when "бен Ладен" (bin Laden) is found the "бен"
> >> > is tagged as name.
> >>
> >> Have you debugged this? It seems strange, as e.g. "Dog" in English won't
> >> be tagged, and the English tagger also extends BaseTagger so it should
> >> behave as the Ukrainian one. Or am I missing something?
> >>
> >
> >
> > Hi,
> >
> > The BaseTagger, by default, tags lowercase words with capitalized word
> tags.
> > To change this, you can add "dontTagLowercaseWithUppercase();" to your
> > UkrainianTagger constructor. I have done it for you.
> >
> > Regards,
> > Jaume
> >
> >
> >
> ------------------------------------------------------------------------------
> > Learn Graph Databases - Download FREE O'Reilly Book
> > "Graph Databases" is the definitive new guide to graph databases and
> their
> > applications. Written by three acclaimed leaders in the field,
> > this first edition is now available. Download your free book today!
> > http://p.sf.net/sfu/13534_NeoTech
> > _______________________________________________
> > Languagetool-devel mailing list
> > Languagetool-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/languagetool-devel
> >
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/13534_NeoTech
> _______________________________________________
> Languagetool-devel mailing list
> Languagetool-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to