I opened https://issues.apache.org/jira/browse/LUCENE-2074
It fixes the problem, the patch uses a different impl depending on matchVersion. If I commit it now, I would regenerate the rc1 artifacts and release the tomorrow to java-user. Currently the ones on people.apache.org are only "known" to java-dev users. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Monday, November 16, 2009 9:59 PM > To: java-dev@lucene.apache.org > Subject: RE: Why release 3.0? > > OK, I checked. The JFLEX file in tunk was 1.4 generated. I regenerated > with > 1.5 and it was different (completely!). I saved the old version and > renamed > to StandardTokenizerImplJava14 extends StandardTokenizerImpl > > By this the impl is exchanged depending on version. The 1.4 version can no > longer be regenerated because it has no .jflex file and should really > never > be regenerated. > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -----Original Message----- > > From: Mark Miller [mailto:markrmil...@gmail.com] > > Sent: Monday, November 16, 2009 9:45 PM > > To: java-dev@lucene.apache.org > > Subject: Re: Why release 3.0? > > > > I still reccomend we add a file then HowToRegenJflex.txt or something - > > that specifically says to use 1.5 or 1.6. I don't changing the current > > notice/warning is visible enough to ensure someone doesn't break this. > > > > Robert Muir wrote: > > > no. its still 4.0, but i hear 1.7 will be 5.1 or 5.2 > > > > > > the only way to truly control this, would be to use something like ICU > > > to control the unicode version being used (and actually be faster, and > > > support higher version). > > > see http://site.icu-project.org/home/why-use-icu4j > > > > > > the issue is that lucene does not have 3rd party library dependencies, > > > on the other hand, i think tika and/or nutch already incorporate icu > > > for charset detection. > > > > > > i won't argue for this really, i know nobody wants it, but you can see > > > how the situation of not being able to control unicode semantics is > > > really difficult for a search engine. > > > > > > On Mon, Nov 16, 2009 at 3:33 PM, Uwe Schindler <uschind...@pangaea.de > > > <mailto:uschind...@pangaea.de>> wrote: > > > > > > Did 1.6 change the unicode version? Robert? > > > > > > ----- > > > UWE SCHINDLER > > > Webserver/Middleware Development > > > PANGAEA - Publishing Network for Geoscientific and Environmental > > Data > > > MARUM - University of Bremen > > > Room 2500, Leobener Str., D-28359 Bremen > > > Tel.: +49 421 218 65595 > > > Fax: +49 421 218 65505 > > > http://www.pangaea.de/ > > > E-mail <http://www.pangaea.de/%0AE-mail>: uschind...@pangaea.de > > > <mailto:uschind...@pangaea.de> > > > > > > > -----Original Message----- > > > > From: Mark Miller [mailto:markrmil...@gmail.com > > > <mailto:markrmil...@gmail.com>] > > > > Sent: Monday, November 16, 2009 9:30 PM > > > > To: java-dev@lucene.apache.org <mailto:java- > d...@lucene.apache.org> > > > > Subject: Re: Why release 3.0? > > > > > > > > And what happens when someone regenerates it with 1.6 without > > > knowing? > > > > > > > > Uwe Schindler wrote: > > > > > I check this by generating the file with 1.4 and 1.5. The 1.4 > > > version > > > > will > > > > > not change anymore, so we just leave the java file no jflex > > > anymore. The > > > > old > > > > > one is used for Lucene until 2.9, if you use > > > matchVersion=LUCENE_30, the > > > > new > > > > > one is used, which can also be regenerated. > > > > > > > > > > ----- > > > > > Uwe Schindler > > > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > > > http://www.thetaphi.de > > > > > eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > > > > > > > > > > > >> -----Original Message----- > > > > >> From: Mark Miller [mailto:markrmil...@gmail.com > > > <mailto:markrmil...@gmail.com>] > > > > >> Sent: Monday, November 16, 2009 9:21 PM > > > > >> To: java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org> > > > > >> Subject: Re: Why release 3.0? > > > > >> > > > > >> Good point - and that likely means the current warning is not > > > working - > > > > >> what can we do to improve it? > > > > >> > > > > >> Perhaps a new text file called jflexregen or something, and > it > > > > >> specifically says you must use java 1.5? > > > > >> > > > > >> Uwe Schindler wrote: > > > > >> > > > > >>> I think the regenerated code in Standard is since years no > > > longer > > > > >>> generated with 1.4 J Most developers use 1.5 or even 1.6. So > > it > > > > >>> already changed incompatible. > > > > >>> > > > > >>> > > > > >>> > > > > >>> ----- > > > > >>> Uwe Schindler > > > > >>> H.-H.-Meier-Allee 63, D-28213 Bremen > > > > >>> http://www.thetaphi.de > > > > >>> eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > >>> > > > > >>> > > > ------------------------------------------------------------------ > -- > > -- > > > > -- > > > > >>> > > > > >>> *From:* Robert Muir [mailto:rcm...@gmail.com > > > <mailto:rcm...@gmail.com>] > > > > >>> *Sent:* Monday, November 16, 2009 8:52 PM > > > > >>> *To:* java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org> > > > > >>> *Subject:* Re: Why release 3.0? > > > > >>> > > > > >>> > > > > >>> > > > > >>> Uwe, thats probably a good solution I think. just as long as > > we > > > > >>> document somewhere, > > > > >>> I think there is some warning verbage in StandardTokenizer > > > already > > > > >>> about this. > > > > >>> > > > > >>> NOTE: if you change StandardTokenizerImpl.jflex and need to > > > regenerate > > > > >>> the tokenizer, remember to use JRE 1.4 to run jflex > > > (before > > > > >>> Lucene 3.0). This grammar now uses constructs (eg > > > :digit:, > > > > >>> :letter:) whose meaning can vary according to the JRE > > > used to > > > > >>> run jflex. See > > > > >>> https://issues.apache.org/jira/browse/LUCENE-1126 for > > > details. > > > > >>> > > > > >>> On Mon, Nov 16, 2009 at 2:50 PM, Uwe Schindler > > > <u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > >>> <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>>> wrote: > > > > >>> > > > > >>> But it is a general warning that should be placed in the > > > Wiki: If you > > > > >>> upgrade from Java 1.4 to Java 5, think about reindexing. > > > > >>> > > > > >>> > > > > >>> > > > > >>> It has definitely nothing to do with 3.0, because uses could > > > have > > > > >>> changed (and most of them have) before. > > > > >>> > > > > >>> ----- > > > > >>> Uwe Schindler > > > > >>> H.-H.-Meier-Allee 63, D-28213 Bremen > > > > >>> http://www.thetaphi.de > > > > >>> eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>> > > > > >>> > > > > >>> > > > ------------------------------------------------------------------ > -- > > -- > > > > -- > > > > >>> > > > > >>> *From:* Robert Muir [mailto:rcm...@gmail.com > > > <mailto:rcm...@gmail.com> > > > > <mailto:rcm...@gmail.com <mailto:rcm...@gmail.com>>] > > > > >>> *Sent:* Monday, November 16, 2009 8:45 PM > > > > >>> > > > > >>> > > > > >>> *To:* java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org> > > > <mailto:java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org>> > > > > >>> *Subject:* Re: Why release 3.0? > > > > >>> > > > > >>> > > > > >>> > > > > >>> right, my point is its true its nothing to do with Lucene at > > > all, > > > > >>> > > > > >> really. > > > > >> > > > > >>> but the reality is we should clarify this to users I think. > > > > >>> > > > > >>> Its especially complex in the current StandardTokenizer, > > > which uses a > > > > >>> mix of hardcoded ranges and properties, can you tell me if > > > you should > > > > >>> reindex for given language X? > > > > >>> I wouldn't want to answer that question right now. > > > > >>> > > > > >>> On Mon, Nov 16, 2009 at 2:42 PM, Uwe Schindler > > > <u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > >>> <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>>> wrote: > > > > >>> > > > > >>> We tried out: Character.getType() for these two chars: > > > > >>> > > > > >>> > > > > >>> > > > > >>> Java 5: > > > > >>> '\u00AD' = 16 > > > > >>> '\u06DD' = 16 > > > > >>> > > > > >>> Java 1.4: > > > > >>> '\u00AD' = 20 > > > > >>> '\u06DD' = 7 > > > > >>> > > > > >>> > > > > >>> > > > > >>> The first is the soft hyphen. > > > > >>> > > > > >>> ----- > > > > >>> Uwe Schindler > > > > >>> H.-H.-Meier-Allee 63, D-28213 Bremen > > > > >>> http://www.thetaphi.de > > > > >>> eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>> > > > > >>> > > > > >>> > > > ------------------------------------------------------------------ > -- > > -- > > > > -- > > > > >>> > > > > >>> *From:* Robert Muir [mailto:rcm...@gmail.com > > > <mailto:rcm...@gmail.com> > > > > <mailto:rcm...@gmail.com <mailto:rcm...@gmail.com>>] > > > > >>> *Sent:* Monday, November 16, 2009 8:37 PM > > > > >>> > > > > >>> > > > > >>> *To:* java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org> > > > <mailto:java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org>> > > > > >>> *Subject:* Re: Why release 3.0? > > > > >>> > > > > >>> > > > > >>> > > > > >>> right, its nothing to do with lucene, instead due to > > > property changes, > > > > >>> etc. > > > > >>> > > > > >>> i just think we should inform users on java 1.4/2.9 that if > > they > > > > >>> upgrade to java 1.5/3.0, they should reindex. > > > > >>> > > > > >>> the reason i say this about properties, is there are some > > > that change > > > > >>> that will affect tokenizers, i give two examples, a hyphen > > that > > > > >>> changes from punctuation to format (might affect > > > > >>> > > > > >> SolrWordDelimiterFilter), > > > > >> > > > > >>> and arabic ayah which changes from NSM to format, which > > > surely affects > > > > >>> ArabicLetterTokenizer. > > > > >>> > > > > >>> On Mon, Nov 16, 2009 at 2:33 PM, Steven A Rowe > > > <sar...@syr.edu <mailto:sar...@syr.edu> > > > > >>> <mailto:sar...@syr.edu <mailto:sar...@syr.edu>>> wrote: > > > > >>> > > > > >>> Hi Robert, > > > > >>> > > > > >>> I agree that the Unicode version supported by the JVM, as > > > you say, > > > > >>> really has nothing to do with Lucene. > > > > >>> > > > > >>> The disruption here is users' upgrading from Java 1.4 to > > > 1.5+, not > > > > >>> when they upgrade Lucene. I'd guess with few exceptions > > > that most > > > > >>> people have been using Lucene with 1.5+ for a couple of > > > years now, > > > > >>> > > > > >> though. > > > > >> > > > > >>> But even the upgrade from Java 1.4 to 1.5+ will have (had) > > > zero impact > > > > >>> on most Lucene users, assuming that most use Latin-1 > > > exclusively; > > > > >>> although I haven't looked, I'd be surprised if Latin-1 > > > characters > > > > >>> changed much, if at all, from Unicode 3.0 to 4.0. > > > > >>> > > > > >>> It would be useful, I think, to include (a pointer to?) a > > > description > > > > >>> of the details of the Unicode 3.0->4.0 differences in the > > > Lucene 3.0 > > > > >>> release notes, since the minimum required Java version, and > > > so also > > > > >>> the supported Unicode version, changes then. > > > > >>> > > > > >>> Steve > > > > >>> > > > > >>> > > > > >>> On 11/16/2009 at 2:15 PM, Robert Muir wrote: > > > > >>> > > > > >>>> the problem is that the properties have changed for various > > > > >>>> > > > > >> characters, > > > > >> > > > > >>>> and new characters were added. > > > > >>>> > > > > >>>> it really has nothing to do with lucene, but the idea you > > > can go from > > > > >>>> jdk 1.4/lucene 2.9 to jdk 1.5/lucene3.0 without reindexing > > > is not > > > > >>>> > > > > >> true. > > > > >> > > > > >>>> On Mon, Nov 16, 2009 at 2:12 PM, Uwe Schindler > > > <u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > >>>> > > > > >>> <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>>> wrote: > > > > >>> > > > > >>>> But an UTF-8 stream from Java 4 can still be read > > > with Java 5, > > > > >>>> what is the problem? Java 5 extended Unicode support, but > > > an index > > > > >>>> created with older versions can still be read. UTF-8 is > > > standardized. > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> ----- > > > > >>>> Uwe Schindler > > > > >>>> H.-H.-Meier-Allee 63, D-28213 Bremen > > > > >>>> http://www.thetaphi.de > > > > >>>> eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>> > > > > >>>> > > > > >>>> > > > > >>>> ________________________________ > > > > >>>> > > > > >>>> > > > > >>>> From: Robert Muir [mailto:rcm...@gmail.com > > > <mailto:rcm...@gmail.com> > > > > >>>> > > > > >>> <mailto:rcm...@gmail.com <mailto:rcm...@gmail.com>>] > > > > >>> > > > > >>>> Sent: Monday, November 16, 2009 8:09 PM > > > > >>>> > > > > >>>> To: java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org> <mailto:java- <mailto:java-> > > > > >>>> > > > > >> d...@lucene.apache.org <mailto:d...@lucene.apache.org>> > > > > >> > > > > >>>> Subject: Re: Why release 3.0? > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> uwe, on topic please read my comment on LUCENE-1689, > > > because > > > > >>>> unicode version was bumped in jdk 1.5, i believe this index > > > backwards > > > > >>>> compatibility is only theoretical > > > > >>>> > > > > >>>> On Mon, Nov 16, 2009 at 2:05 PM, Uwe Schindler > > > <u...@thetaphi.de <mailto:u...@thetaphi.de> > > > > >>>> > > > > >>> <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>>> wrote: > > > > >>> > > > > >>>> 2.9 has *not* the same format as 3.0, an index > > > created with 3.0 > > > > >>>> cannot be read with 2.9. This is because compressed field > > > support was > > > > >>>> removed and therefore the version number of the stored > > > fields file > > > > was > > > > >>>> upgraded. But indexes from 2.9 can be read with 3.0 and > > > support may > > > > >>>> > > > > >> get > > > > >> > > > > >>>> removed in 4.0. 3.0 Indexes can be read until version 4.9. > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Uwe > > > > >>>> > > > > >>>> ----- > > > > >>>> Uwe Schindler > > > > >>>> H.-H.-Meier-Allee 63, D-28213 Bremen > > > > >>>> http://www.thetaphi.de > > > > >>>> eMail: u...@thetaphi.de <mailto:u...@thetaphi.de> > > > <mailto:u...@thetaphi.de <mailto:u...@thetaphi.de>> > > > > >>>> > > > > >>>> > > > > >>>> ________________________________ > > > > >>>> > > > > >>>> > > > > >>>> From: Jake Mannix [mailto:jake.man...@gmail.com > > > <mailto:jake.man...@gmail.com> > > > > >>>> > > > > >>> <mailto:jake.man...@gmail.com > <mailto:jake.man...@gmail.com>>] > > > > >>> > > > > >>>> Sent: Monday, November 16, 2009 7:15 PM > > > > >>>> > > > > >>>> > > > > >>>> To: java-dev@lucene.apache.org > > > <mailto:java-dev@lucene.apache.org> <mailto:java- <mailto:java-> > > > > >>>> > > > > >> d...@lucene.apache.org <mailto:d...@lucene.apache.org>> > > > > >> > > > > >>>> Subject: Re: Why release 3.0? > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> Don't users need to upgrade to 3.0 because 3.1 won't > be > > > > >>>> necessarily able to read your > > > > >>>> 2.4 index file formats? I suppose if you've already > > > upgraded > > > > to > > > > >>>> 2.9, then all is well because > > > > >>>> 2.9 is the same format as 3.0, but we can't assume > > > all users > > > > >>>> upgraded from 2.4 to 2.9. > > > > >>>> > > > > >>>> If you've done that already, then 3.0 might not be > > > necessary, > > > > >>>> but if you're on 2.4 right now, > > > > >>>> you will be in for a bad surprise if you try to > > > upgrade to 3.1. > > > > >>>> > > > > >>>> -jake > > > > >>>> > > > > >>>> On Mon, Nov 16, 2009 at 10:10 AM, Erick Erickson > > > > >>>> <erickerick...@gmail.com <mailto:erickerick...@gmail.com> > > > <mailto:erickerick...@gmail.com <mailto:erickerick...@gmail.com>>> > > > wrote: > > > > >>>> > > > > >>>> One of my "specialties" is asking obvious questions > > > just to see > > > > >>>> if everyone's assumptions are aligned. So with the > > > discussion about > > > > >>>> branching 3.0 I have to ask "Is there going to be any 3.0 > > > release > > > > >>>> intended for *production*?". And if not, would we save a > lot > > of > > > > >>>> work by just not worrying about retrofitting fixes to a 3.0 > > > branch > > > > >>>> and carrying on with 3.1 as the first *supported* 3.x > > release? > > > > >>>> > > > > >>>> Since 3.0 is "upgrade-to-java5 and remove > > > deprecations", I'm > > > > not > > > > >>>> sure *as a user* I see a good reason to upgrade to 3.0. > > > Getting a > > > > >>>> "beta/snapshot" release to get a head start on cleaning up > > > my code > > > > >>>> does seem worthwhile, if I have the spare time. And having > > > a base > > > > >>>> 3.0 version that's not changing all over the place would be > > > useful > > > > >>>> for that. > > > > >>>> > > > > >>>> That said, I'm also not terribly comfortable with a > > > "release" > > > > >>>> that's out there and unsupported. > > > > >>>> > > > > >>>> Apologies if this has already been discussed, but I > > don't > > > > >>>> remember it. Although my memory isn't what it used to be > (but > > > > >>>> some would claim it never was<G>)... > > > > >>>> > > > > >>>> Erick > > > > >>>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> Robert Muir > > > > >>> rcm...@gmail.com <mailto:rcm...@gmail.com> > > > <mailto:rcm...@gmail.com <mailto:rcm...@gmail.com>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> Robert Muir > > > > >>> rcm...@gmail.com <mailto:rcm...@gmail.com> > > > <mailto:rcm...@gmail.com <mailto:rcm...@gmail.com>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> -- > > > > >>> Robert Muir > > > > >>> rcm...@gmail.com <mailto:rcm...@gmail.com> > > > <mailto:rcm...@gmail.com <mailto:rcm...@gmail.com>> > > > > >>> > > > > >>> > > > > >> -- > > > > >> - Mark > > > > >> > > > > >> http://www.lucidimagination.com > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > ------------------------------------------------------------------ > -- > > - > > > > >> To unsubscribe, e-mail: > > > java-dev-unsubscr...@lucene.apache.org > > > <mailto:java-dev-unsubscr...@lucene.apache.org> > > > > >> For additional commands, e-mail: > > > java-dev-h...@lucene.apache.org > > > <mailto:java-dev-h...@lucene.apache.org> > > > > >> > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------ > -- > > - > > > > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > > <mailto:java-dev-unsubscr...@lucene.apache.org> > > > > > For additional commands, e-mail: > > > java-dev-h...@lucene.apache.org > > > <mailto:java-dev-h...@lucene.apache.org> > > > > > > > > > > > > > > > > > > > > > > -- > > > > - Mark > > > > > > > > http://www.lucidimagination.com > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------ > -- > > - > > > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > > <mailto:java-dev-unsubscr...@lucene.apache.org> > > > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > <mailto:java-dev-h...@lucene.apache.org> > > > > > > > > > > > > ------------------------------------------------------------------ > -- > > - > > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > > <mailto:java-dev-unsubscr...@lucene.apache.org> > > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > <mailto:java-dev-h...@lucene.apache.org> > > > > > > > > > > > > > > > -- > > > Robert Muir > > > rcm...@gmail.com <mailto:rcm...@gmail.com> > > > > > > -- > > - Mark > > > > http://www.lucidimagination.com > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org