[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Mark Miller (JIRA) Mon, 05 Oct 2009 21:08:22 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mark Miller updated LUCENE-1458:
--------------------------------

    Attachment: LUCENE-1458.patch

eh - even if you have moved on, if I'm going to put up a patch, might as well 
do it right - here is another:

* removed a boatload of unused imports
* removed DefaultSkipListWriter/Reader - I accidently put them back in
* removed an unused field or two (not all)
* paramaterized LegacySegmentMergeQueue.java
* Fixed the double read I mentioned in previous comment in IndexWriter
* TermRef defines an equals (that throws UOE) and not hashCode - early stuff I 
guess but odd since no class extends it. Added a hashCode that throws UOE 
anyway.
* fixed bug in TermRangeTermsEnum: lowerTermRef = new TermRef(lowerTermText); 
to lowerTermRef = new TermRef(this.lowerTermText);
* Fixed Remote contrib test to work with TermRef for fieldcache parser (since 
you don't include contrib in the tar)
* Missed a StringBuffer to StringBuilder in MultiTermQuery.toString
* had missed removing deprecated IndexReader.open(final Directory directory) 
and deprecated IndexReader.open(final IndexCommit commit)
* Paramertized some stuff in ParrallelReader that made sense - what the heck
* added a nocommit or two on unread fields with a comment that made it look 
like they were/will be used
* Looks like SegmentTermPositions.java may have been screwy in last patch - 
ensure its now a deleted file - same with TermInfosWriter.java
* You left getEnum(IndexReader reader) in the MultiTerm queries, but no in 
PrefixQuery - just checkin'.
* Missed removing listAll from FileSwitchDirectory - gone
* cleaned up some white space nothings in the patch
* I guess TestBackwardsCompatibility.java has been removed from trunk or 
something? kept it here for now.
* looks like i missed merging in a change to 
TestIndexWriter.java#assertNoUnreferencedFiles - done
* doubled checked my merge work

core and contrib tests pass




> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, 
> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, 
> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, 
> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback.  All tests pass, though back compat tests don't pass due to
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too many times,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branch on
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to use the
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
>     since we only load an array of offsets and an array of String (no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
>   * Introduces modular reader codec that strongly decouples terms dict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates the
>     old API on top of the new API to keep back-compat.
>     
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
>   * Maybe switch to AttributeSources as the base class for TermsEnum,
>     DocsEnum, PostingsEnum -- this would give readers API flexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to