On 04/21/2013 03:11 AM, Jaume Ortolà i Font wrote:
2013/4/21 Andriy Rysin <ary...@gmail.com <mailto:ary...@gmail.com>>

    1) I would like to treat several apostrophes equally (apostrophes are
    part of the word in Ukrainian), e.g. in dictionary and rules I
    could use
    ' (0x27) but I would like to be able to parse text that has U+2019
    (and
    potentially U+02BC) the same way, I guess I could do a simple
    replace in
    word tokenizer but I was wondering if there's a better way

This is what is done in Catalan. So far  I have found no problem.

This seems to work pretty nice for *replacing* chars, but if I also *remove* accent (U+0301) from words in word tokenizer it looks like it messes up the error position in the sentence (at least in the web interface). Is there a right way to remove symbols I don't care about?

Thanks
Andriy
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to