Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

John Wang Thu, 08 Oct 2009 11:59:35 -0700

Awesome!

Mike, can you let us know what the process is and the time line?


Thanks

-John

On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless <
[email protected]> wrote:

> +1!
>
> Mike
>
> On Thu, Oct 8, 2009 at 2:41 PM, John Wang <[email protected]> wrote:
> > Hi guys:
> >
> >      What are your thoughts about contributing Kamikaze as a lucene
> contrib
> > package? We just finished porting kamikaze to lucene 2.9. With the new
> 2.9
> > api, it allows us for some more code tuning and optimization
> improvements.
> >
> >      We will be releasing kamikaze, it might a good time to add it to the
> > lucene contrib package if there is interest.
> >
> > Thanks
> >
> > -John
> >
> > On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler <[email protected]> wrote:
> >>
> >> By the way: In the last RC of Lucene 2.9 we added a new method to
> DocIdSet
> >> called isCacheable(). It is used by e.g. CachingWrapperFilter to
> >> determine,
> >> if a DocIdSet is easy cacheable or must be copied to an OpenBitSetDISI
> >> (the
> >> default is false, so all custom DocIdSets are copied to OpenBitSetDISI
> by
> >> CachingWrapperFilter, even if not needed - if a DocIdSet does not do
> disk
> >> IO
> >> and have a fast iterator like e.g. the FieldCache ones in
> >> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe
> this
> >> should also be added to Kamikaze, which is a really nice project!
> >> Especially
> >> filter DocIdSets should pass this method to its delegate (see
> >> FilterDocIdSet
> >> in Lucene).
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: [email protected]
> >>
> >>
> >> > -----Original Message-----
> >> > From: John Wang (JIRA) [mailto:[email protected]]
> >> > Sent: Thursday, September 24, 2009 3:14 PM
> >> > To: [email protected]
> >> > Subject: [jira] Commented: (LUCENE-1458) Further steps towards
> flexible
> >> > indexing
> >> >
> >> >
> >> >     [ https://issues.apache.org/jira/browse/LUCENE-
> >> > 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> >> > tabpanel&focusedCommentId=12759112#action_12759112 ]
> >> >
> >> > John Wang commented on LUCENE-1458:
> >> > -----------------------------------
> >> >
> >> > Just a FYI: Kamikaze was originally started as our sandbox for Lucene
> >> > contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
> >> > abstraction that was migrated from Solr)
> >> >
> >> > It has three components:
> >> >
> >> > 1) P4Delta
> >> > 2) Logical boolean operations on DocIdSet/Iterators (I have created a
> >> > jira
> >> > ticket and a patch for Lucene awhile ago with performance numbers. It
> is
> >> > significantly faster than DisjunctionScorer)
> >> > 3) algorithm to determine which DocIdSet implementations to use given
> >> > some
> >> > parameters, e.g. miniD,maxid,id count etc. It learns and adjust from
> the
> >> > application behavior if not all parameters are given.
> >> >
> >> > So please feel free to incorporate anything you see if or move it to
> >> > contrib.
> >> >
> >> >
> >> > > Further steps towards flexible indexing
> >> > > ---------------------------------------
> >> > >
> >> > >                 Key: LUCENE-1458
> >> > >                 URL:
> https://issues.apache.org/jira/browse/LUCENE-1458
> >> > >             Project: Lucene - Java
> >> > >          Issue Type: New Feature
> >> > >          Components: Index
> >> > >    Affects Versions: 2.9
> >> > >            Reporter: Michael McCandless
> >> > >            Assignee: Michael McCandless
> >> > >            Priority: Minor
> >> > >         Attachments: LUCENE-1458-back-compat.patch,
> LUCENE-1458-back-
> >> > compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch,
> LUCENE-
> >> > 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> >> > LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
> >> > 1458.tar.bz2, LUCENE-1458.tar.bz2
> >> > >
> >> > >
> >> > > I attached a very rough checkpoint of my current patch, to get early
> >> > > feedback.  All tests pass, though back compat tests don't pass due
> to
> >> > > changes to package-private APIs plus certain bugs in tests that
> >> > > happened to work (eg call TermPostions.nextPosition() too many
> times,
> >> > > which the new API asserts against).
> >> > > [Aside: I think, when we commit changes to package-private APIs such
> >> > > that back-compat tests don't pass, we could go back, make a branch
> on
> >> > > the back-compat tag, commit changes to the tests to use the new
> >> > > package private APIs on that branch, then fix nightly build to use
> the
> >> > > tip of that branch?o]
> >> > > There's still plenty to do before this is committable! This is a
> >> > > rather large change:
> >> > >   * Switches to a new more efficient terms dict format.  This still
> >> > >     uses tii/tis files, but the tii only stores term & long offset
> >> > >     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >> > >     offsets absolutely instead of with deltas delta.  Also, tis/tii
> >> > >     are structured by field, so we don't have to record field number
> >> > >     in every term.
> >> > > .
> >> > >     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
> >> > >     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> >> > > .
> >> > >     RAM usage when loading terms dict index is significantly less
> >> > >     since we only load an array of offsets and an array of String
> (no
> >> > >     more TermInfo array).  It should be faster to init too.
> >> > > .
> >> > >     This part is basically done.
> >> > >   * Introduces modular reader codec that strongly decouples terms
> dict
> >> > >     from docs/positions readers.  EG there is no more TermInfo used
> >> > >     when reading the new format.
> >> > > .
> >> > >     There's nice symmetry now between reading & writing in the codec
> >> > >     chain -- the current docs/prox format is captured in:
> >> > > {code}
> >> > > FormatPostingsTermsDictWriter/Reader
> >> > > FormatPostingsDocsWriter/Reader (.frq file) and
> >> > > FormatPostingsPositionsWriter/Reader (.prx file).
> >> > > {code}
> >> > >     This part is basically done.
> >> > >   * Introduces a new "flex" API for iterating through the fields,
> >> > >     terms, docs and positions:
> >> > > {code}
> >> > > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> >> > > {code}
> >> > >     This replaces TermEnum/Docs/Positions.  SegmentReader emulates
> the
> >> > >     old API on top of the new API to keep back-compat.
> >> > >
> >> > > Next steps:
> >> > >   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> >> > >     fix any hidden assumptions.
> >> > >   * Expose new API out of IndexReader, deprecate old API but emulate
> >> > >     old API on top of new one, switch all core/contrib users to the
> >> > >     new API.
> >> > >   * Maybe switch to AttributeSources as the base class for
> TermsEnum,
> >> > >     DocsEnum, PostingsEnum -- this would give readers API
> flexibility
> >> > >     (not just index-file-format flexibility).  EG if someone wanted
> >> > >     to store payload at the term-doc level instead of
> >> > >     term-doc-position level, you could just add a new attribute.
> >> > >   * Test performance & iterate.
> >> >
> >> > --
> >> > This message is automatically generated by JIRA.
> >> > -
> >> > You can reply to this email to add a comment to the issue online.
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: [email protected]
> >> > For additional commands, e-mail: [email protected]
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to