[
https://issues.apache.org/jira/browse/LUCENE-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159983#comment-13159983
]
Martin Oberhuber commented on LUCENE-3607:
------------------------------------------
Hi all,
thanks for the many comments. I understand that there's no desire changing
behavior that's been working (and documented!) for years.
What about a different approach ... would it be possible to write a small Java
"main" that normalizes an index, very much like "stripping" an EXE ? That way I
could postprocess my indexes (which are meant for distribution with our
product), but at its core Lucene could continue working as today.
Regarding some other comments,
- Our main reason for shipping a pre-built index is "initial search"
performance. In a large eclipse based product, generating the docs index on
initial search can take approx 4 minutes on a decent computer. With everything
pre-indexed, initial search can proceed after 10 seconds. That's an important
usability issue for our help system. Another reason is the desire to find any
index building errors at build-time (where we can investigate them) rather than
runtime.
- We do have both the build environment and the deployment environment under
full control (same lucene version, same JVM version, same ICU version, all our
content is en_US).
- Regarding heuristics ... sure the search is heuristic at runtime, but that's
a very different thing than having the build environment heuristic... having
identical input produce identical output is still desirable.
- The issue of different analyzes used at index generation time vs. runtime has
indeed bitten us in the past (see
[[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c16]]). In my personal
opinion, the choice of analyzer should be bound to the content, and not to the
search environment ... since in many cases the language of the search string
will not be known, but the language of the documents / index is known. Right
now, the best workaround for this at Eclipse is launching Eclipse with a "-nl
en_US" argument to force US locale when I know all the docs are US... but that
won't work at all in an environment where some docs are English and others are
German, a very common scenario with software products on Eclipse (main product
may be localized but some plugins are not).
Is that "analyzer binding to content vs. binding to search" issue known and
discussed at Lucene already ? I.e. is it possible to have parts of the index
(the US one) searched with an US analyzer but other parts (the German one) with
a German analyzer ? And, why does the German analyzer truncate words at "."
while the US one does not (See
[[https://bugs.eclipse.org/bugs/show_bug.cgi?id=219928#c18 ]]) ?
> Lucene Index files can not be reproduced faithfully (due to timestamps
> embedded)
> --------------------------------------------------------------------------------
>
> Key: LUCENE-3607
> URL: https://issues.apache.org/jira/browse/LUCENE-3607
> Project: Lucene - Java
> Issue Type: Bug
> Components: core/index
> Affects Versions: 2.9.1
> Environment: Eclipse 3.7
> Reporter: Martin Oberhuber
> Assignee: Michael McCandless
>
> Eclipse 3.7 uses Lucene 2.9.1 for indexing online help content. A
> pre-generated help index can be shipped together with online content. As per
> [[https://bugs.eclipse.org/bugs/show_bug.cgi?id=364979 ]]
> it turns out that the help index can not be faithfully reproduced during a
> build, because there are timestamps embedded in the index files, and the
> "NameCounter" field in segments_2 contains different contents on every build.
> Not being able to faithfully reproduce the index from identical source bits
> undermines trust in the index (and software delivery) being correct.
> I'm wondering whether this is a known issue and/or has been addressed in a
> newer Lucene version already ?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]