On 04/21/2013 03:11 AM, Jaume Ortolà i Font wrote:
2013/4/21 Andriy Rysin <ary...@gmail.com <mailto:ary...@gmail.com>>
1) I would like to treat several apostrophes equally (apostrophes are
part of the word in Ukrainian), e.g. in dictionary and rules I
could use
' (0x27) but I would like to be able to parse text that has U+2019
(and
potentially U+02BC) the same way, I guess I could do a simple
replace in
word tokenizer but I was wondering if there's a better way
This is what is done in Catalan. So far I have found no problem.
This seems to work pretty nice for *replacing* chars, but if I also
*remove* accent (U+0301) from words in word tokenizer it looks like it
messes up the error position in the sentence (at least in the web
interface). Is there a right way to remove symbols I don't care about?
Thanks
Andriy
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel