It would be nice to have btree like features such as previous(), min and max. Also a unique sequence id per term that enables faster lookup if the term id is known.
On Wed, Nov 19, 2008 at 1:38 PM, Michael McCandless < [EMAIL PROTECTED]> wrote: > > I think we wouldn't do any term compression for the btree, at least for the > parts loaded in RAM (we don't today, ie, we create the full Term or String > as an array). > > For the parts left on disk we should be able to do something similar to > what we do today, eg for child nodes only encode the "delta" wrt the parent > node? > > Mike > > > Jason Rutherglen wrote: > > Michael B: Are you interested in making column stride fields realtime and >> use the btree for the terms? This is an idea I started on I called tag >> index where the postings are divided into blocks. The blocks can then be >> replaced in memory with periodic flush to disk as the in ram postings grows. >> >> Michael M: How would the term compression be handled in a btree model? >> >> On Wed, Nov 19, 2008 at 2:29 AM, Michael McCandless (JIRA) < >> [EMAIL PROTECTED]> wrote: >> >> [ >> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648971#action_12648971 >> ] >> >> Michael McCandless commented on LUCENE-1458: >> -------------------------------------------- >> >> bq. So something like a B+Tree would probably work better. >> >> I agree, btree is a better fit, though we don't need insertion & deletion >> operations since each segment is write once. >> >> > Further steps towards flexible indexing >> > --------------------------------------- >> > >> > Key: LUCENE-1458 >> > URL: https://issues.apache.org/jira/browse/LUCENE-1458 >> > Project: Lucene - Java >> > Issue Type: New Feature >> > Components: Index >> > Affects Versions: 2.9 >> > Reporter: Michael McCandless >> > Assignee: Michael McCandless >> > Priority: Minor >> > Fix For: 2.9 >> > >> > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, >> LUCENE-1458.patch >> > >> > >> > I attached a very rough checkpoint of my current patch, to get early >> > feedback. All tests pass, though back compat tests don't pass due to >> > changes to package-private APIs plus certain bugs in tests that >> > happened to work (eg call TermPostions.nextPosition() too many times, >> > which the new API asserts against). >> > [Aside: I think, when we commit changes to package-private APIs such >> > that back-compat tests don't pass, we could go back, make a branch on >> > the back-compat tag, commit changes to the tests to use the new >> > package private APIs on that branch, then fix nightly build to use the >> > tip of that branch?o] >> > There's still plenty to do before this is committable! This is a >> > rather large change: >> > * Switches to a new more efficient terms dict format. This still >> > uses tii/tis files, but the tii only stores term & long offset >> > (not a TermInfo). At seek points, tis encodes term & freq/prox >> > offsets absolutely instead of with deltas delta. Also, tis/tii >> > are structured by field, so we don't have to record field number >> > in every term. >> > . >> > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB >> > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). >> > . >> > RAM usage when loading terms dict index is significantly less >> > since we only load an array of offsets and an array of String (no >> > more TermInfo array). It should be faster to init too. >> > . >> > This part is basically done. >> > * Introduces modular reader codec that strongly decouples terms dict >> > from docs/positions readers. EG there is no more TermInfo used >> > when reading the new format. >> > . >> > There's nice symmetry now between reading & writing in the codec >> > chain -- the current docs/prox format is captured in: >> > {code} >> > FormatPostingsTermsDictWriter/Reader >> > FormatPostingsDocsWriter/Reader (.frq file) and >> > FormatPostingsPositionsWriter/Reader (.prx file). >> > {code} >> > This part is basically done. >> > * Introduces a new "flex" API for iterating through the fields, >> > terms, docs and positions: >> > {code} >> > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum >> > {code} >> > This replaces TermEnum/Docs/Positions. SegmentReader emulates the >> > old API on top of the new API to keep back-compat. >> > >> > Next steps: >> > * Plug in new codecs (pulsing, pfor) to exercise the modularity / >> > fix any hidden assumptions. >> > * Expose new API out of IndexReader, deprecate old API but emulate >> > old API on top of new one, switch all core/contrib users to the >> > new API. >> > * Maybe switch to AttributeSources as the base class for TermsEnum, >> > DocsEnum, PostingsEnum -- this would give readers API flexibility >> > (not just index-file-format flexibility). EG if someone wanted >> > to store payload at the term-doc level instead of >> > term-doc-position level, you could just add a new attribute. >> > * Test performance & iterate. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >