Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Michael McCandless Fri, 21 Nov 2008 11:59:21 -0800

We could easily add a sequence ID (ord) today, for a single segment'sterm dict; but merging them (so that MultiSegmentReader could alsopresent TermEnum.ord()) is problematic.


Mike

Jason Rutherglen wrote:

It would be nice to have btree like features such as previous(), minand max. Also a unique sequence id per term that enables fasterlookup if the term id is known.
On Wed, Nov 19, 2008 at 1:38 PM, Michael McCandless <[EMAIL PROTECTED]> wrote:
I think we wouldn't do any term compression for the btree, at leastfor the parts loaded in RAM (we don't today, ie, we create the fullTerm or String as an array).
For the parts left on disk we should be able to do something similarto what we do today, eg for child nodes only encode the "delta" wrtthe parent node?
Mike


Jason Rutherglen wrote:
Michael B: Are you interested in making column stride fieldsrealtime and use the btree for the terms? This is an idea I startedon I called tag index where the postings are divided into blocks.The blocks can then be replaced in memory with periodic flush todisk as the in ram postings grows.
Michael M: How would the term compression be handled in a btree model?
On Wed, Nov 19, 2008 at 2:29 AM, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648971#action_12648971 ]
Michael McCandless commented on LUCENE-1458:
--------------------------------------------

bq. So something like a B+Tree would probably work better.
I agree, btree is a better fit, though we don't need insertion &deletion operations since each segment is write once.
> Further steps towards flexible indexing
> ---------------------------------------
>
>                 Key: LUCENE-1458
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1458
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.9
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
> Attachments: LUCENE-1458.patch, LUCENE-1458.patch,LUCENE-1458.patch
>
>
> I attached a very rough checkpoint of my current patch, to get early
> feedback. All tests pass, though back compat tests don't pass dueto
> changes to package-private APIs plus certain bugs in tests that
> happened to work (eg call TermPostions.nextPosition() too manytimes,
> which the new API asserts against).
> [Aside: I think, when we commit changes to package-private APIs such
> that back-compat tests don't pass, we could go back, make a branchon
> the back-compat tag, commit changes to the tests to use the new
> package private APIs on that branch, then fix nightly build to usethe
> tip of that branch?o]
> There's still plenty to do before this is committable! This is a
> rather large change:
>   * Switches to a new more efficient terms dict format.  This still
>     uses tii/tis files, but the tii only stores term & long offset
>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
>     offsets absolutely instead of with deltas delta.  Also, tis/tii
>     are structured by field, so we don't have to record field number
>     in every term.
> .
>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> .
>     RAM usage when loading terms dict index is significantly less
> since we only load an array of offsets and an array of String(no
>     more TermInfo array).  It should be faster to init too.
> .
>     This part is basically done.
> * Introduces modular reader codec that strongly decouples termsdict
>     from docs/positions readers.  EG there is no more TermInfo used
>     when reading the new format.
> .
>     There's nice symmetry now between reading & writing in the codec
>     chain -- the current docs/prox format is captured in:
> {code}
> FormatPostingsTermsDictWriter/Reader
> FormatPostingsDocsWriter/Reader (.frq file) and
> FormatPostingsPositionsWriter/Reader (.prx file).
> {code}
>     This part is basically done.
>   * Introduces a new "flex" API for iterating through the fields,
>     terms, docs and positions:
> {code}
> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> {code}
> This replaces TermEnum/Docs/Positions. SegmentReader emulatesthe
>     old API on top of the new API to keep back-compat.
>
> Next steps:
>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
>     fix any hidden assumptions.
>   * Expose new API out of IndexReader, deprecate old API but emulate
>     old API on top of new one, switch all core/contrib users to the
>     new API.
> * Maybe switch to AttributeSources as the base class forTermsEnum,> DocsEnum, PostingsEnum -- this would give readers APIflexibility
>     (not just index-file-format flexibility).  EG if someone wanted
>     to store payload at the term-doc level instead of
>     term-doc-position level, you could just add a new attribute.
>   * Test performance & iterate.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to