Another good reference is this one: http://unicode.org/reports/tr29/
Since the latest Lucene uses this for the basis of its text segmentation, it's worth getting familiar with it. On Fri, Mar 30, 2012 at 10:09 AM, Robert Muir <rcm...@gmail.com> wrote: > On Fri, Mar 30, 2012 at 1:03 PM, Denis Brodeur <denisbrod...@gmail.com> wrote: >> Thanks Robert. That makes sense. Do you have a link handy where I can >> find this information? i.e. word boundary/punctuation for any unicode >> character set? >> > > yeah, usually i use > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[\u0f10-\u0f19]&g= > > you can then click on a character and see all of its properties easily. > > (site seems to have some issues today) > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org