[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12775017#action_12775017 ]
Michael McCandless commented on LUCENE-1458: -------------------------------------------- I removed all the "if (Codec.DEBUG)" lines a local checkout and re-ran sortBench.py -- looks like flex is pretty close to trunk now (on OpenSolaris, Java 1.5, at least): JAVA: java version "1.5.0_19" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02) Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode) OS: SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris Index /x/lucene/wiki.baseline.nd5M already exists... Index /x/lucene/wiki.flex.nd5M already exists... ||Query||Deletes %||Tot hits||QPS old||QPS new||Pct change|| |body:[tec TO tet]|0.0|1934684|2.95|4.04|{color:green}36.9%{color}| |body:[tec TO tet]|0.1|1932754|2.86|3.73|{color:green}30.4%{color}| |body:[tec TO tet]|1.0|1915224|2.88|3.69|{color:green}28.1%{color}| |body:[tec TO tet]|10|1741255|2.86|3.74|{color:green}30.8%{color}| |real*|0.0|389378|26.85|28.74|{color:green}7.0%{color}| |real*|0.1|389005|25.83|26.96|{color:green}4.4%{color}| |real*|1.0|385434|25.55|27.15|{color:green}6.3%{color}| |real*|10|350404|25.38|28.10|{color:green}10.7%{color}| |1|0.0|1170209|21.75|21.80|{color:green}0.2%{color}| |1|0.1|1169068|20.39|22.02|{color:green}8.0%{color}| |1|1.0|1158528|20.35|21.88|{color:green}7.5%{color}| |1|10|1053269|20.48|21.96|{color:green}7.2%{color}| |2|0.0|1088727|23.37|23.42|{color:green}0.2%{color}| |2|0.1|1087700|21.61|23.49|{color:green}8.7%{color}| |2|1.0|1077788|21.85|23.46|{color:green}7.4%{color}| |2|10|980068|21.93|23.66|{color:green}7.9%{color}| |+1 +2|0.0|700793|7.29|7.32|{color:green}0.4%{color}| |+1 +2|0.1|700137|6.58|6.70|{color:green}1.8%{color}| |+1 +2|1.0|693756|6.60|6.68|{color:green}1.2%{color}| |+1 +2|10|630953|6.73|6.92|{color:green}2.8%{color}| |+1 -2|0.0|469416|8.07|7.69|{color:red}-4.7%{color}| |+1 -2|0.1|468931|7.02|7.46|{color:green}6.3%{color}| |+1 -2|1.0|464772|7.31|7.12|{color:red}-2.6%{color}| |+1 -2|10|422316|7.28|7.60|{color:green}4.4%{color}| |1 2 3 -4|0.0|1104704|4.83|4.52|{color:red}-6.4%{color}| |1 2 3 -4|0.1|1103583|4.73|4.48|{color:red}-5.3%{color}| |1 2 3 -4|1.0|1093634|4.75|4.46|{color:red}-6.1%{color}| |1 2 3 -4|10|994046|4.87|4.65|{color:red}-4.5%{color}| |"world economy"|0.0|985|19.50|20.11|{color:green}3.1%{color}| |"world economy"|0.1|984|18.65|19.76|{color:green}6.0%{color}| |"world economy"|1.0|970|19.56|18.71|{color:red}-4.3%{color}| |"world economy"|10|884|19.58|20.19|{color:green}3.1%{color}| > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Attachments: LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, > LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, > LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2 > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org