Re: Why release 3.0?

Robert Muir Mon, 16 Nov 2009 12:37:52 -0800

no. its still 4.0, but i hear 1.7 will be 5.1 or 5.2

the only way to truly control this, would be to use something like ICU to
control the unicode version being used (and actually be faster, and support
higher version).
see http://site.icu-project.org/home/why-use-icu4j


the issue is that lucene does not have 3rd party library dependencies, on
the other hand, i think tika and/or nutch already incorporate icu for
charset detection.

i won't argue for this really, i know nobody wants it, but you can see how
the situation of not being able to control unicode semantics is really
difficult for a search engine.

On Mon, Nov 16, 2009 at 3:33 PM, Uwe Schindler <[email protected]>wrote:

> Did 1.6 change the unicode version? Robert?
>
> -----
> UWE SCHINDLER
> Webserver/Middleware Development
> PANGAEA - Publishing Network for Geoscientific and Environmental Data
> MARUM - University of Bremen
> Room 2500, Leobener Str., D-28359 Bremen
> Tel.: +49 421 218 65595
> Fax:  +49 421 218 65505
> http://www.pangaea.de/
> E-mail <http://www.pangaea.de/%0AE-mail>: [email protected]
>
> > -----Original Message-----
> > From: Mark Miller [mailto:[email protected]]
> > Sent: Monday, November 16, 2009 9:30 PM
> > To: [email protected]
> > Subject: Re: Why release 3.0?
> >
> > And what happens when someone regenerates it with 1.6 without knowing?
> >
> > Uwe Schindler wrote:
> > > I check this by generating the file with 1.4 and 1.5. The 1.4 version
> > will
> > > not change anymore, so we just leave the java file no jflex anymore.
> The
> > old
> > > one is used for Lucene until 2.9, if you use matchVersion=LUCENE_30,
> the
> > new
> > > one is used, which can also be regenerated.
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: [email protected]
> > >
> > >
> > >> -----Original Message-----
> > >> From: Mark Miller [mailto:[email protected]]
> > >> Sent: Monday, November 16, 2009 9:21 PM
> > >> To: [email protected]
> > >> Subject: Re: Why release 3.0?
> > >>
> > >> Good point - and that likely means the current warning is not working
> -
> > >> what can we do to improve it?
> > >>
> > >> Perhaps a new text file called jflexregen or something, and it
> > >> specifically says you must use java 1.5?
> > >>
> > >> Uwe Schindler wrote:
> > >>
> > >>> I think the regenerated code in Standard is since years no longer
> > >>> generated with 1.4 J Most developers use 1.5 or even 1.6. So it
> > >>> already changed incompatible.
> > >>>
> > >>>
> > >>>
> > >>> -----
> > >>> Uwe Schindler
> > >>> H.-H.-Meier-Allee 63, D-28213 Bremen
> > >>> http://www.thetaphi.de
> > >>> eMail: [email protected]
> > >>>
> > >>>
> ----------------------------------------------------------------------
> > --
> > >>>
> > >>> *From:* Robert Muir [mailto:[email protected]]
> > >>> *Sent:* Monday, November 16, 2009 8:52 PM
> > >>> *To:* [email protected]
> > >>> *Subject:* Re: Why release 3.0?
> > >>>
> > >>>
> > >>>
> > >>> Uwe, thats probably a good solution I think. just as long as we
> > >>> document somewhere,
> > >>> I think there is some warning verbage in StandardTokenizer already
> > >>> about this.
> > >>>
> > >>> NOTE: if you change StandardTokenizerImpl.jflex and need to
> regenerate
> > >>>       the tokenizer, remember to use JRE 1.4 to run jflex (before
> > >>>       Lucene 3.0).  This grammar now uses constructs (eg :digit:,
> > >>>       :letter:) whose meaning can vary according to the JRE used to
> > >>>       run jflex.  See
> > >>>       https://issues.apache.org/jira/browse/LUCENE-1126 for details.
> > >>>
> > >>> On Mon, Nov 16, 2009 at 2:50 PM, Uwe Schindler <[email protected]
> > >>> <mailto:[email protected]>> wrote:
> > >>>
> > >>> But it is a general warning that should be placed in the Wiki: If you
> > >>> upgrade from Java 1.4 to Java 5, think about reindexing.
> > >>>
> > >>>
> > >>>
> > >>> It has definitely nothing to do with 3.0, because uses could have
> > >>> changed (and most of them have) before.
> > >>>
> > >>> -----
> > >>> Uwe Schindler
> > >>> H.-H.-Meier-Allee 63, D-28213 Bremen
> > >>> http://www.thetaphi.de
> > >>> eMail: [email protected] <mailto:[email protected]>
> > >>>
> > >>>
> ----------------------------------------------------------------------
> > --
> > >>>
> > >>> *From:* Robert Muir [mailto:[email protected]
> > <mailto:[email protected]>]
> > >>> *Sent:* Monday, November 16, 2009 8:45 PM
> > >>>
> > >>>
> > >>> *To:* [email protected] <mailto:[email protected]>
> > >>> *Subject:* Re: Why release 3.0?
> > >>>
> > >>>
> > >>>
> > >>> right, my point is its true its nothing to do with Lucene at all,
> > >>>
> > >> really.
> > >>
> > >>> but the reality is we should clarify this to users I think.
> > >>>
> > >>> Its especially complex in the current StandardTokenizer, which uses a
> > >>> mix of hardcoded ranges and properties, can you tell me if you should
> > >>> reindex for given language X?
> > >>> I wouldn't want to answer that question right now.
> > >>>
> > >>> On Mon, Nov 16, 2009 at 2:42 PM, Uwe Schindler <[email protected]
> > >>> <mailto:[email protected]>> wrote:
> > >>>
> > >>> We tried out: Character.getType() for these two chars:
> > >>>
> > >>>
> > >>>
> > >>> Java 5:
> > >>> '\u00AD' = 16
> > >>> '\u06DD' = 16
> > >>>
> > >>> Java 1.4:
> > >>> '\u00AD' = 20
> > >>> '\u06DD' = 7
> > >>>
> > >>>
> > >>>
> > >>> The first is the soft hyphen.
> > >>>
> > >>> -----
> > >>> Uwe Schindler
> > >>> H.-H.-Meier-Allee 63, D-28213 Bremen
> > >>> http://www.thetaphi.de
> > >>> eMail: [email protected] <mailto:[email protected]>
> > >>>
> > >>>
> ----------------------------------------------------------------------
> > --
> > >>>
> > >>> *From:* Robert Muir [mailto:[email protected]
> > <mailto:[email protected]>]
> > >>> *Sent:* Monday, November 16, 2009 8:37 PM
> > >>>
> > >>>
> > >>> *To:* [email protected] <mailto:[email protected]>
> > >>> *Subject:* Re: Why release 3.0?
> > >>>
> > >>>
> > >>>
> > >>> right, its nothing to do with lucene, instead due to property
> changes,
> > >>> etc.
> > >>>
> > >>> i just think we should inform users on java 1.4/2.9 that if they
> > >>> upgrade to java 1.5/3.0, they should reindex.
> > >>>
> > >>> the reason i say this about properties, is there are some that change
> > >>> that will affect tokenizers, i give two examples, a hyphen that
> > >>> changes from punctuation to format (might affect
> > >>>
> > >> SolrWordDelimiterFilter),
> > >>
> > >>> and arabic ayah which changes from NSM to format, which surely
> affects
> > >>> ArabicLetterTokenizer.
> > >>>
> > >>> On Mon, Nov 16, 2009 at 2:33 PM, Steven A Rowe <[email protected]
> > >>> <mailto:[email protected]>> wrote:
> > >>>
> > >>> Hi Robert,
> > >>>
> > >>> I agree that the Unicode version supported by the JVM, as you say,
> > >>> really has nothing to do with Lucene.
> > >>>
> > >>> The disruption here is users' upgrading from Java 1.4 to 1.5+, not
> > >>> when they upgrade Lucene.  I'd guess with few exceptions that most
> > >>> people have been using Lucene with 1.5+ for a couple of years now,
> > >>>
> > >> though.
> > >>
> > >>> But even the upgrade from Java 1.4 to 1.5+ will have (had) zero
> impact
> > >>> on most Lucene users, assuming that most use Latin-1 exclusively;
> > >>> although I haven't looked, I'd be surprised if Latin-1 characters
> > >>> changed much, if at all, from Unicode 3.0 to 4.0.
> > >>>
> > >>> It would be useful, I think, to include (a pointer to?) a description
> > >>> of the details of the Unicode 3.0->4.0 differences in the Lucene 3.0
> > >>> release notes, since the minimum required Java version, and so also
> > >>> the supported Unicode version, changes then.
> > >>>
> > >>> Steve
> > >>>
> > >>>
> > >>> On 11/16/2009 at 2:15 PM, Robert Muir wrote:
> > >>>
> > >>>> the problem is that the properties have changed for various
> > >>>>
> > >> characters,
> > >>
> > >>>> and new characters were added.
> > >>>>
> > >>>> it really has nothing to do with lucene, but the idea you can go
> from
> > >>>> jdk 1.4/lucene 2.9 to jdk 1.5/lucene3.0 without reindexing is not
> > >>>>
> > >> true.
> > >>
> > >>>> On Mon, Nov 16, 2009 at 2:12 PM, Uwe Schindler <[email protected]
> > >>>>
> > >>> <mailto:[email protected]>> wrote:
> > >>>
> > >>>>       But an UTF-8 stream from Java 4 can still be read with Java 5,
> > >>>> what is the problem? Java 5 extended Unicode support, but an index
> > >>>> created with older versions can still be read. UTF-8 is
> standardized.
> > >>>>
> > >>>>
> > >>>>
> > >>>>       -----
> > >>>>       Uwe Schindler
> > >>>>       H.-H.-Meier-Allee 63, D-28213 Bremen
> > >>>>       http://www.thetaphi.de
> > >>>>       eMail: [email protected] <mailto:[email protected]>
> > >>>>
> > >>>>
> > >>>> ________________________________
> > >>>>
> > >>>>
> > >>>>       From: Robert Muir [mailto:[email protected]
> > >>>>
> > >>> <mailto:[email protected]>]
> > >>>
> > >>>>       Sent: Monday, November 16, 2009 8:09 PM
> > >>>>
> > >>>>       To: [email protected] <mailto:java-
> > >>>>
> > >> [email protected]>
> > >>
> > >>>>       Subject: Re: Why release 3.0?
> > >>>>
> > >>>>
> > >>>>
> > >>>>       uwe, on topic please read my comment on LUCENE-1689, because
> > >>>> unicode version was bumped in jdk 1.5, i believe this index
> backwards
> > >>>> compatibility is only theoretical
> > >>>>
> > >>>>       On Mon, Nov 16, 2009 at 2:05 PM, Uwe Schindler <
> [email protected]
> > >>>>
> > >>> <mailto:[email protected]>> wrote:
> > >>>
> > >>>>       2.9 has *not* the same format as 3.0, an index created with
> 3.0
> > >>>> cannot be read with 2.9. This is because compressed field support
> was
> > >>>> removed and therefore the version number of the stored fields file
> > was
> > >>>> upgraded. But indexes from 2.9 can be read with 3.0 and support may
> > >>>>
> > >> get
> > >>
> > >>>> removed in 4.0. 3.0 Indexes can be read until version 4.9.
> > >>>>
> > >>>>
> > >>>>
> > >>>>       Uwe
> > >>>>
> > >>>>       -----
> > >>>>       Uwe Schindler
> > >>>>       H.-H.-Meier-Allee 63, D-28213 Bremen
> > >>>>       http://www.thetaphi.de
> > >>>>       eMail: [email protected] <mailto:[email protected]>
> > >>>>
> > >>>>
> > >>>> ________________________________
> > >>>>
> > >>>>
> > >>>>       From: Jake Mannix [mailto:[email protected]
> > >>>>
> > >>> <mailto:[email protected]>]
> > >>>
> > >>>>       Sent: Monday, November 16, 2009 7:15 PM
> > >>>>
> > >>>>
> > >>>>       To: [email protected] <mailto:java-
> > >>>>
> > >> [email protected]>
> > >>
> > >>>>       Subject: Re: Why release 3.0?
> > >>>>
> > >>>>
> > >>>>
> > >>>>       Don't users need to upgrade to 3.0 because 3.1 won't be
> > >>>> necessarily able to read your
> > >>>>       2.4 index file formats?  I suppose if you've already upgraded
> > to
> > >>>> 2.9, then all is well because
> > >>>>       2.9 is the same format as 3.0, but we can't assume all users
> > >>>> upgraded from 2.4 to 2.9.
> > >>>>
> > >>>>       If you've done that already, then 3.0 might not be necessary,
> > >>>> but if you're on 2.4 right now,
> > >>>>       you will be in for a bad surprise if you try to upgrade to
> 3.1.
> > >>>>
> > >>>>         -jake
> > >>>>
> > >>>>       On Mon, Nov 16, 2009 at 10:10 AM, Erick Erickson
> > >>>> <[email protected] <mailto:[email protected]>> wrote:
> > >>>>
> > >>>>       One of my "specialties" is asking obvious questions just to
> see
> > >>>> if everyone's assumptions are aligned. So with the discussion about
> > >>>> branching 3.0 I have to ask "Is there going to be any 3.0 release
> > >>>> intended for *production*?". And if not, would we save a lot of
> > >>>> work by just not worrying about retrofitting fixes to a 3.0 branch
> > >>>> and carrying on with 3.1 as the first *supported* 3.x release?
> > >>>>
> > >>>>       Since 3.0 is "upgrade-to-java5 and remove deprecations", I'm
> > not
> > >>>> sure *as a user* I see a good reason to upgrade to 3.0. Getting a
> > >>>> "beta/snapshot" release to get a head start on cleaning up my code
> > >>>> does seem worthwhile, if I have the spare time. And having a base
> > >>>> 3.0 version that's not changing all over the place would be useful
> > >>>> for that.
> > >>>>
> > >>>>       That said, I'm also not terribly comfortable with a "release"
> > >>>> that's out there and unsupported.
> > >>>>
> > >>>>       Apologies if this has already been discussed, but I don't
> > >>>> remember it. Although my memory isn't what it used to be (but
> > >>>> some would claim it never was<G>)...
> > >>>>
> > >>>>       Erick
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Robert Muir
> > >>> [email protected] <mailto:[email protected]>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Robert Muir
> > >>> [email protected] <mailto:[email protected]>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Robert Muir
> > >>> [email protected] <mailto:[email protected]>
> > >>>
> > >>>
> > >> --
> > >> - Mark
> > >>
> > >> http://www.lucidimagination.com
> > >>
> > >>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>


-- 
Robert Muir
[email protected]

Re: Why release 3.0?

Reply via email to