[
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12774768#action_12774768
]
Michael McCandless commented on LUCENE-1458:
--------------------------------------------
OK new numbers after the above commits:
JAVA:
java version "1.5.0_19"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02)
Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode)
OS:
SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris
||Query||Deletes %||Tot hits||QPS old||QPS new||Pct change||
|body:[tec TO tet]|0.0|1934684|3.13|3.96|{color:green}26.5%{color}|
|body:[tec TO tet]|0.1|1932754|2.98|3.62|{color:green}21.5%{color}|
|body:[tec TO tet]|1.0|1915224|2.97|3.62|{color:green}21.9%{color}|
|body:[tec TO tet]|10|1741255|2.96|3.61|{color:green}22.0%{color}|
|real*|0.0|389378|27.80|28.73|{color:green}3.3%{color}|
|real*|0.1|389005|26.74|28.93|{color:green}8.2%{color}|
|real*|1.0|385434|26.61|29.04|{color:green}9.1%{color}|
|real*|10|350404|26.32|29.29|{color:green}11.3%{color}|
|1|0.0|1170209|21.81|22.27|{color:green}2.1%{color}|
|1|0.1|1169068|20.41|21.47|{color:green}5.2%{color}|
|1|1.0|1158528|20.42|21.41|{color:green}4.8%{color}|
|1|10|1053269|20.52|21.39|{color:green}4.2%{color}|
|2|0.0|1088727|23.29|23.86|{color:green}2.4%{color}|
|2|0.1|1087700|21.67|22.92|{color:green}5.8%{color}|
|2|1.0|1077788|21.77|22.80|{color:green}4.7%{color}|
|2|10|980068|21.90|23.04|{color:green}5.2%{color}|
|+1 +2|0.0|700793|7.25|6.65|{color:red}-8.3%{color}|
|+1 +2|0.1|700137|6.58|6.33|{color:red}-3.8%{color}|
|+1 +2|1.0|693756|6.50|6.32|{color:red}-2.8%{color}|
|+1 +2|10|630953|6.73|6.37|{color:red}-5.3%{color}|
|+1 -2|0.0|469416|8.11|7.27|{color:red}-10.4%{color}|
|+1 -2|0.1|468931|7.02|6.61|{color:red}-5.8%{color}|
|+1 -2|1.0|464772|7.27|6.75|{color:red}-7.2%{color}|
|+1 -2|10|422316|7.28|6.99|{color:red}-4.0%{color}|
|1 2 3 -4|0.0|1104704|4.80|4.46|{color:red}-7.1%{color}|
|1 2 3 -4|0.1|1103583|4.74|4.40|{color:red}-7.2%{color}|
|1 2 3 -4|1.0|1093634|4.72|4.45|{color:red}-5.7%{color}|
|1 2 3 -4|10|994046|4.79|4.63|{color:red}-3.3%{color}|
|"world economy"|0.0|985|19.43|16.79|{color:red}-13.6%{color}|
|"world economy"|0.1|984|18.71|16.59|{color:red}-11.3%{color}|
|"world economy"|1.0|970|19.65|16.86|{color:red}-14.2%{color}|
|"world economy"|10|884|19.69|17.25|{color:red}-12.4%{color}|
The term range query & preifx query are now a bit faster; boolean queries are
somewhat slower; the phrase query shows the biggest slowdown...
> Further steps towards flexible indexing
> ---------------------------------------
>
> Key: LUCENE-1458
> URL: https://issues.apache.org/jira/browse/LUCENE-1458
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Index
> Affects Versions: 2.9
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-1458-back-compat.patch,
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch,
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2,
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2,
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback. All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
> * Switches to a new more efficient terms dict format. This still
> uses tii/tis files, but the tii only stores term & long offset
> (not a TermInfo). At seek points, tis encodes term & freq/prox
> offsets absolutely instead of with deltas delta. Also, tis/tii
> are structured by field, so we don't have to record field number
> in every term.
> .
> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
> RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String (no
> more TermInfo array). It should be faster to init too.
> .
> This part is basically done.
> * Introduces modular reader codec that strongly decouples terms dict
> from docs/positions readers. EG there is no more TermInfo used
> when reading the new format.
> .
> There's nice symmetry now between reading & writing in the codec
> chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
> This part is basically done.
> * Introduces a new "flex" API for iterating through the fields,
> terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions. SegmentReader emulates the
> old API on top of the new API to keep back-compat.
>
> Next steps:
> * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> fix any hidden assumptions.
> * Expose new API out of IndexReader, deprecate old API but emulate
> old API on top of new one, switch all core/contrib users to the
> new API.
> * Maybe switch to AttributeSources as the base class for TermsEnum,
> DocsEnum, PostingsEnum -- this would give readers API flexibility
> (not just index-file-format flexibility). EG if someone wanted
> to store payload at the term-doc level instead of
> term-doc-position level, you could just add a new attribute.
> * Test performance & iterate.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]