[jira] [Commented] (LUCENE-3607) Lucene Index files can not be reproduced faithfully (due to timestamps embedded)

Uwe Schindler (Commented) (JIRA) Mon, 28 Nov 2011 14:32:04 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158866#comment-13158866
 ]


Uwe Schindler commented on LUCENE-3607:
---------------------------------------

Martin,
when reading your Eclipse issue about the packaged Lucene indexes I imagine 
that I know the reason why you want to package pre-built indexes with Eclipse 
(as Robert and me are the persons behind the well-known Porter-Stemmer Java 7 
bug...): You want your IDE not crashing is somebody uses Java 1.7.0 GA.

There are other problems with shipping pre-built indexes, which I also found 
out during preparation for a talk about the Java 7 bug, during which I was not 
able to make Eclipse crash on my Java 7 Germany-localized Windows machine. This 
is another bug in eclipse itsself, but it would heavily affect your prebuilt 
indexes. Any help system supporting full-text search does not ship with 
prebuilt indexes (e.g. Windows CHM  help builds the indexes on first use, like 
Eclipse did in the past).

If you really want to package pre-built indexes with the release you have to 
make sure:

- The target platform of your index is exactly equal your platform where you 
indexed unless you're using only analyzers which do not depend on the 
underlying JVM (e.g. ICU analyzers with a fixed ICU4J version shipped together 
with the product). Otherwise it can happen that you index was built with JDK 6 
(which uses Unicode 4.0) will not work correctly on a computer with JDK 7, as 
the character properties vary on the target platform (Java 7 uses Unicode 6.0). 
This especially affects Asian languages.
- You also have to make sure that the target platform uses the same analyzer 
set like the indexing platform. The problem I had, trying to reproduce the 
famous crash on my German-localized O/S was caused by an ugly bug in your help 
system (you should open an issue at Eclipse as this is a real bug, I dont use 
Eclipse so I just wanted to have something crashing because of this bug in my 
talk at GotoCon Aarhus): The help system indexes the text using an analyzer 
that is created based on the locale of the operating system (in my case de_DE; 
as the analyzer you have choosen for de_DE does not use Porter Stemmer my 
Eclipse installation did *not* crash because of the Java 7 bug). In fact the 
help text itsself was English. So the Eclipse runtime indexed English help text 
with a German analyzer which is a bug. When I later changed the locale to 
English, my searches hit no results anymore as during searching suddenly a 
different analyzer was used that was used while indexing (after wiping the 
index, Eclipse of course crashed because of the Java 7 bug). The correct 
behaviour would be that the help file itsself ships with its own language as 
metadata that is then used for indexing! So an english help file should have a 
property saying "I am english". This must be fixed before releasing indexes 
with eclipse installers: The analyzer used for query analysis must be identical 
the one (including unicode version) used during indexing. If it depends on 
local its incorrect and would not fit the index on disk (shipped with the 
installation files).

And as noted a few comments before: During indexing you have to use a 
predictable non-parallelized MergeScheduler/MergePolicy (I recommend 
SerialMergeScheduler and LogDocMergePolicy), otherwise the second non 
predictable factor is segment merging. Also maybe forceMerge the index to 1 
segment (but dont use TieredMergePolicy as it may reorder segments during 
merging).

And I agree with Yonik, we should keep the version numbers and index metadata 
time-dependent in the index. We also need information like the operating system 
used to build segments in our index file as metadata. If you want to have 
indexing be 100% predicatable write a custom codec in 4.0 that does not write 
"random" numbers and environment metaddata in headers + use 
SerialMergeScheduler with LogDocMergePolicy.
                
> Lucene Index files can not be reproduced faithfully (due to timestamps 
> embedded)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3607
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 2.9.1
>         Environment: Eclipse 3.7
>            Reporter: Martin Oberhuber
>            Assignee: Michael McCandless
>
> Eclipse 3.7 uses Lucene 2.9.1 for indexing online help content. A 
> pre-generated help index can be shipped together with online content. As per
>    [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=364979 ]]
> it turns out that the help index can not be faithfully reproduced during a 
> build, because there are timestamps embedded in the index files, and the 
> "NameCounter" field in segments_2 contains different contents on every build.
> Not being able to faithfully reproduce the index from identical source bits 
> undermines trust in the index (and software delivery) being correct.
> I'm wondering whether this is a known issue and/or has been addressed in a 
> newer Lucene version already ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3607) Lucene Index files can not be reproduced faithfully (due to timestamps embedded)

Reply via email to