Michael just out of curiousity, did you have a particular Analyzer in mind you were planning on using, or rather certain features in Lucene you were concerned would work with these codepoints?
On Fri, Jul 31, 2009 at 12:19 PM, Simon Willnauer<[email protected]> wrote: > Hey Robert, good to see that you found the link :) > > On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir<[email protected]> wrote: >> Michael, as Simon mentioned I created an issue describing where you >> might run into trouble, at least in lucene core. >> >> The low-level lucene stuff, it treats these just fine (as surrogate pairs). >> >> But most analyzers run into some trouble. (things like >> WhitespaceAnalyzer are ok) >> >> Also wildcard queries and some things like that might not work as you >> expect, for example ? operator will not match a codepoint > FFFF, but >> of course you could use ?? as a workaround. >> >> On Fri, Jul 31, 2009 at 10:54 AM, Michael Thomsen<[email protected]> >> wrote: >>> Thanks for your quick response! >>> >>> Mike >>> >>> On Fri, Jul 31, 2009 at 10:25 AM, Simon >>> Willnauer<[email protected]> wrote: >>>> If I understand you correctly you are asking if lucene can deal with >>>> encodings that use more than 16 bit. Well yes and no but mainly no. >>>> The support for unicode 4.0 was introduced in Java 1.5 and lucene core >>>> has still back-compat requirements for java 1.4. Lucene's analyzers >>>> make use of char[] all over the place which is a sequence of UTF-16 >>>> code unit not a code point. As I said the support for codepoints was >>>> introduced in 1.5 and I can remember that there is an issue which aims >>>> to implement support for upplementary characters (those above FFFF). >>>> Such a character is represented as 2 chars and the most of the >>>> analysis code will simply remove those characters. >>>> Have a look at this issue: >>>> https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you >>>> working on this?) >>>> >>>> I'm sure there will be support for that in lucene 3.1. >>>> >>>> Simon >>>> On Fri, Jul 31, 2009 at 4:08 PM, Michael Thomsen<[email protected]> >>>> wrote: >>>>> Is Lucene capable of handling UCS4 data natively? >>>>> >>>>> Thanks, >>>>> >>>>> Mike >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> >> >> -- >> Robert Muir >> [email protected] >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Robert Muir [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
