@Michael: add yourself as a Watcher for the issue. @Robert: I can start working on this within the next weeks - can you help too?
simon On Fri, Jul 31, 2009 at 7:49 PM, Robert Muir<[email protected]> wrote: > Michael, makes sense. most of the issues probably have some > workaround, so reply back if you need. > > Thanks for your feedback though, it is helpful to know that its important! > > On Fri, Jul 31, 2009 at 1:36 PM, Michael Thomsen<[email protected]> > wrote: >> Not really. At this point, I just needed to know where the UCS4 >> support stands. I'm reasonably familiar with the various analyzers and >> what they can do. It's just the state of UCS4 support that might be an >> issue for us. >> >> Thanks, >> >> Mike >> >> On Fri, Jul 31, 2009 at 12:25 PM, Robert Muir<[email protected]> wrote: >>> Michael just out of curiousity, did you have a particular Analyzer in >>> mind you were planning on using, or rather certain features in Lucene >>> you were concerned would work with these codepoints? >>> >>> On Fri, Jul 31, 2009 at 12:19 PM, Simon >>> Willnauer<[email protected]> wrote: >>>> Hey Robert, good to see that you found the link :) >>>> >>>> On Fri, Jul 31, 2009 at 6:06 PM, Robert Muir<[email protected]> wrote: >>>>> Michael, as Simon mentioned I created an issue describing where you >>>>> might run into trouble, at least in lucene core. >>>>> >>>>> The low-level lucene stuff, it treats these just fine (as surrogate >>>>> pairs). >>>>> >>>>> But most analyzers run into some trouble. (things like >>>>> WhitespaceAnalyzer are ok) >>>>> >>>>> Also wildcard queries and some things like that might not work as you >>>>> expect, for example ? operator will not match a codepoint > FFFF, but >>>>> of course you could use ?? as a workaround. >>>>> >>>>> On Fri, Jul 31, 2009 at 10:54 AM, Michael Thomsen<[email protected]> >>>>> wrote: >>>>>> Thanks for your quick response! >>>>>> >>>>>> Mike >>>>>> >>>>>> On Fri, Jul 31, 2009 at 10:25 AM, Simon >>>>>> Willnauer<[email protected]> wrote: >>>>>>> If I understand you correctly you are asking if lucene can deal with >>>>>>> encodings that use more than 16 bit. Well yes and no but mainly no. >>>>>>> The support for unicode 4.0 was introduced in Java 1.5 and lucene core >>>>>>> has still back-compat requirements for java 1.4. Lucene's analyzers >>>>>>> make use of char[] all over the place which is a sequence of UTF-16 >>>>>>> code unit not a code point. As I said the support for codepoints was >>>>>>> introduced in 1.5 and I can remember that there is an issue which aims >>>>>>> to implement support for upplementary characters (those above FFFF). >>>>>>> Such a character is represented as 2 chars and the most of the >>>>>>> analysis code will simply remove those characters. >>>>>>> Have a look at this issue: >>>>>>> https://issues.apache.org/jira/browse/LUCENE-1689 ( @ Robert are you >>>>>>> working on this?) >>>>>>> >>>>>>> I'm sure there will be support for that in lucene 3.1. >>>>>>> >>>>>>> Simon >>>>>>> On Fri, Jul 31, 2009 at 4:08 PM, Michael >>>>>>> Thomsen<[email protected]> wrote: >>>>>>>> Is Lucene capable of handling UCS4 data natively? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Mike >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>>> For additional commands, e-mail: [email protected] >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: [email protected] >>>>>>> For additional commands, e-mail: [email protected] >>>>>>> >>>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Robert Muir >>>>> [email protected] >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> >>> >>> -- >>> Robert Muir >>> [email protected] >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > > -- > Robert Muir > [email protected] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
