[jira] [Commented] (LUCENE-3607) Lucene Index files can not be reproduced faithfully (due to timestamps embedded)

Martin Oberhuber (Commented) (JIRA) Wed, 30 Nov 2011 03:22:09 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159983#comment-13159983
 ]


Martin Oberhuber commented on LUCENE-3607:
------------------------------------------

Hi all,

thanks for the many comments. I understand that there's no desire changing 
behavior that's been working (and documented!) for years.

What about a different approach ... would it be possible to write a small Java 
"main" that normalizes an index, very much like "stripping" an EXE ? That way I 
could postprocess my indexes (which are meant for distribution with our 
product), but at its core Lucene could continue working as today.

Regarding some other comments,

- Our main reason for shipping a pre-built index is "initial search" 
performance. In a large eclipse based product, generating the docs index on 
initial search can take approx 4 minutes on a decent computer. With everything 
pre-indexed, initial search can proceed after 10 seconds. That's an important 
usability issue for our help system. Another reason is the desire to find any 
index building errors at build-time (where we can investigate them) rather than 
runtime.

- We do have both the build environment and the deployment environment under 
full control (same lucene version, same JVM version, same ICU version, all our 
content is en_US).

- Regarding heuristics ... sure the search is heuristic at runtime, but that's 
a very different thing than having the build environment heuristic... having 
identical input produce identical output is still desirable.

- The issue of different analyzes used at index generation time vs. runtime has 
indeed bitten us in the past (see 
[[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c16]]). In my personal 
opinion, the choice of analyzer should be bound to the content, and not to the 
search environment ... since in many cases the language of the search string 
will not be known, but the language of the documents / index is known. Right 
now, the best workaround for this at Eclipse is launching Eclipse with a "-nl 
en_US" argument to force US locale when I know all the docs are US... but that 
won't work at all in an environment where some docs are English and others are 
German, a very common scenario with software products on Eclipse (main product 
may be localized but some plugins are not).

Is that "analyzer binding to content vs. binding to search" issue known and 
discussed at Lucene already ? I.e. is it possible to have parts of the index 
(the US one) searched with an US analyzer but other parts (the German one) with 
a German analyzer ? And, why does the German analyzer truncate words at "." 
while the US one does not (See 
[[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c18 ]]) ?
                
> Lucene Index files can not be reproduced faithfully (due to timestamps 
> embedded)
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-3607
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3607
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/index
>    Affects Versions: 2.9.1
>         Environment: Eclipse 3.7
>            Reporter: Martin Oberhuber
>            Assignee: Michael McCandless
>
> Eclipse 3.7 uses Lucene 2.9.1 for indexing online help content. A 
> pre-generated help index can be shipped together with online content. As per
>    [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=364979 ]]
> it turns out that the help index can not be faithfully reproduced during a 
> build, because there are timestamps embedded in the index files, and the 
> "NameCounter" field in segments_2 contains different contents on every build.
> Not being able to faithfully reproduce the index from identical source bits 
> undermines trust in the index (and software delivery) being correct.
> I'm wondering whether this is a known issue and/or has been addressed in a 
> newer Lucene version already ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3607) Lucene Index files can not be reproduced faithfully (due to timestamps embedded)

Reply via email to