Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

John Wang Thu, 08 Oct 2009 14:06:36 -0700

sounds good.
WIll get the rolling in a few days.

Thanks


-John

On Thu, Oct 8, 2009 at 1:09 PM, Mark Miller <[email protected]> wrote:

> Yup - you need for anything developed outside of Apache.
>
> Michael McCandless wrote:
> > Well, it's the usual process... pull together a big patch, open an issue,
> etc.
> >
> > Probably because it's a large amount of code (I think?) you'll need to
> > submit a software grant
> > (http://www.apache.org/licenses/software-grant.txt).
> >
> > Mike
> >
> > On Thu, Oct 8, 2009 at 2:58 PM, John Wang <[email protected]> wrote:
> >
> >> Awesome!
> >>
> >> Mike, can you let us know what the process is and the time line?
> >>
> >> Thanks
> >>
> >> -John
> >>
> >> On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless
> >> <[email protected]> wrote:
> >>
> >>> +1!
> >>>
> >>> Mike
> >>>
> >>> On Thu, Oct 8, 2009 at 2:41 PM, John Wang <[email protected]> wrote:
> >>>
> >>>> Hi guys:
> >>>>
> >>>>      What are your thoughts about contributing Kamikaze as a lucene
> >>>> contrib
> >>>> package? We just finished porting kamikaze to lucene 2.9. With the new
> >>>> 2.9
> >>>> api, it allows us for some more code tuning and optimization
> >>>> improvements.
> >>>>
> >>>>      We will be releasing kamikaze, it might a good time to add it to
> >>>> the
> >>>> lucene contrib package if there is interest.
> >>>>
> >>>> Thanks
> >>>>
> >>>> -John
> >>>>
> >>>> On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler <[email protected]>
> wrote:
> >>>>
> >>>>> By the way: In the last RC of Lucene 2.9 we added a new method to
> >>>>> DocIdSet
> >>>>> called isCacheable(). It is used by e.g. CachingWrapperFilter to
> >>>>> determine,
> >>>>> if a DocIdSet is easy cacheable or must be copied to an
> OpenBitSetDISI
> >>>>> (the
> >>>>> default is false, so all custom DocIdSets are copied to
> OpenBitSetDISI
> >>>>> by
> >>>>> CachingWrapperFilter, even if not needed - if a DocIdSet does not do
> >>>>> disk
> >>>>> IO
> >>>>> and have a fast iterator like e.g. the FieldCache ones in
> >>>>> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe
> >>>>> this
> >>>>> should also be added to Kamikaze, which is a really nice project!
> >>>>> Especially
> >>>>> filter DocIdSets should pass this method to its delegate (see
> >>>>> FilterDocIdSet
> >>>>> in Lucene).
> >>>>>
> >>>>> -----
> >>>>> Uwe Schindler
> >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen
> >>>>> http://www.thetaphi.de
> >>>>> eMail: [email protected]
> >>>>>
> >>>>>
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: John Wang (JIRA) [mailto:[email protected]]
> >>>>>> Sent: Thursday, September 24, 2009 3:14 PM
> >>>>>> To: [email protected]
> >>>>>> Subject: [jira] Commented: (LUCENE-1458) Further steps towards
> >>>>>> flexible
> >>>>>> indexing
> >>>>>>
> >>>>>>
> >>>>>>     [ https://issues.apache.org/jira/browse/LUCENE-
> >>>>>> 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> >>>>>> tabpanel&focusedCommentId=12759112#action_12759112 ]
> >>>>>>
> >>>>>> John Wang commented on LUCENE-1458:
> >>>>>> -----------------------------------
> >>>>>>
> >>>>>> Just a FYI: Kamikaze was originally started as our sandbox for
> Lucene
> >>>>>> contributions until 2.4 is ready. (we needed the DocIdSet/Iterator
> >>>>>> abstraction that was migrated from Solr)
> >>>>>>
> >>>>>> It has three components:
> >>>>>>
> >>>>>> 1) P4Delta
> >>>>>> 2) Logical boolean operations on DocIdSet/Iterators (I have created
> a
> >>>>>> jira
> >>>>>> ticket and a patch for Lucene awhile ago with performance numbers.
> It
> >>>>>> is
> >>>>>> significantly faster than DisjunctionScorer)
> >>>>>> 3) algorithm to determine which DocIdSet implementations to use
> given
> >>>>>> some
> >>>>>> parameters, e.g. miniD,maxid,id count etc. It learns and adjust from
> >>>>>> the
> >>>>>> application behavior if not all parameters are given.
> >>>>>>
> >>>>>> So please feel free to incorporate anything you see if or move it to
> >>>>>> contrib.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Further steps towards flexible indexing
> >>>>>>> ---------------------------------------
> >>>>>>>
> >>>>>>>                 Key: LUCENE-1458
> >>>>>>>                 URL:
> >>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458
> >>>>>>>             Project: Lucene - Java
> >>>>>>>          Issue Type: New Feature
> >>>>>>>          Components: Index
> >>>>>>>    Affects Versions: 2.9
> >>>>>>>            Reporter: Michael McCandless
> >>>>>>>            Assignee: Michael McCandless
> >>>>>>>            Priority: Minor
> >>>>>>>         Attachments: LUCENE-1458-back-compat.patch,
> >>>>>>> LUCENE-1458-back-
> >>>>>>>
> >>>>>> compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch,
> >>>>>> LUCENE-
> >>>>>> 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch,
> >>>>>> LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-
> >>>>>> 1458.tar.bz2, LUCENE-1458.tar.bz2
> >>>>>>
> >>>>>>> I attached a very rough checkpoint of my current patch, to get
> >>>>>>> early
> >>>>>>> feedback.  All tests pass, though back compat tests don't pass due
> >>>>>>> to
> >>>>>>> changes to package-private APIs plus certain bugs in tests that
> >>>>>>> happened to work (eg call TermPostions.nextPosition() too many
> >>>>>>> times,
> >>>>>>> which the new API asserts against).
> >>>>>>> [Aside: I think, when we commit changes to package-private APIs
> >>>>>>> such
> >>>>>>> that back-compat tests don't pass, we could go back, make a branch
> >>>>>>> on
> >>>>>>> the back-compat tag, commit changes to the tests to use the new
> >>>>>>> package private APIs on that branch, then fix nightly build to use
> >>>>>>> the
> >>>>>>> tip of that branch?o]
> >>>>>>> There's still plenty to do before this is committable! This is a
> >>>>>>> rather large change:
> >>>>>>>   * Switches to a new more efficient terms dict format.  This still
> >>>>>>>     uses tii/tis files, but the tii only stores term & long offset
> >>>>>>>     (not a TermInfo).  At seek points, tis encodes term & freq/prox
> >>>>>>>     offsets absolutely instead of with deltas delta.  Also, tis/tii
> >>>>>>>     are structured by field, so we don't have to record field
> >>>>>>> number
> >>>>>>>     in every term.
> >>>>>>> .
> >>>>>>>     On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99
> >>>>>>> MB
> >>>>>>>     -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
> >>>>>>> .
> >>>>>>>     RAM usage when loading terms dict index is significantly less
> >>>>>>>     since we only load an array of offsets and an array of String
> >>>>>>> (no
> >>>>>>>     more TermInfo array).  It should be faster to init too.
> >>>>>>> .
> >>>>>>>     This part is basically done.
> >>>>>>>   * Introduces modular reader codec that strongly decouples terms
> >>>>>>> dict
> >>>>>>>     from docs/positions readers.  EG there is no more TermInfo used
> >>>>>>>     when reading the new format.
> >>>>>>> .
> >>>>>>>     There's nice symmetry now between reading & writing in the
> >>>>>>> codec
> >>>>>>>     chain -- the current docs/prox format is captured in:
> >>>>>>> {code}
> >>>>>>> FormatPostingsTermsDictWriter/Reader
> >>>>>>> FormatPostingsDocsWriter/Reader (.frq file) and
> >>>>>>> FormatPostingsPositionsWriter/Reader (.prx file).
> >>>>>>> {code}
> >>>>>>>     This part is basically done.
> >>>>>>>   * Introduces a new "flex" API for iterating through the fields,
> >>>>>>>     terms, docs and positions:
> >>>>>>> {code}
> >>>>>>> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
> >>>>>>> {code}
> >>>>>>>     This replaces TermEnum/Docs/Positions.  SegmentReader emulates
> >>>>>>> the
> >>>>>>>     old API on top of the new API to keep back-compat.
> >>>>>>>
> >>>>>>> Next steps:
> >>>>>>>   * Plug in new codecs (pulsing, pfor) to exercise the modularity /
> >>>>>>>     fix any hidden assumptions.
> >>>>>>>   * Expose new API out of IndexReader, deprecate old API but
> >>>>>>> emulate
> >>>>>>>     old API on top of new one, switch all core/contrib users to the
> >>>>>>>     new API.
> >>>>>>>   * Maybe switch to AttributeSources as the base class for
> >>>>>>> TermsEnum,
> >>>>>>>     DocsEnum, PostingsEnum -- this would give readers API
> >>>>>>> flexibility
> >>>>>>>     (not just index-file-format flexibility).  EG if someone wanted
> >>>>>>>     to store payload at the term-doc level instead of
> >>>>>>>     term-doc-position level, you could just add a new attribute.
> >>>>>>>   * Test performance & iterate.
> >>>>>>>
> >>>>>> --
> >>>>>> This message is automatically generated by JIRA.
> >>>>>> -
> >>>>>> You can reply to this email to add a comment to the issue online.
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: [email protected]
> >>>>>> For additional commands, e-mail: [email protected]
> >>>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [email protected]
> >>>>> For additional commands, e-mail: [email protected]
> >>>>>
> >>>>>
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [email protected]
> >>> For additional commands, e-mail: [email protected]
> >>>
> >>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing

Reply via email to