sounds good. WIll get the rolling in a few days. Thanks
-John On Thu, Oct 8, 2009 at 1:09 PM, Mark Miller <markrmil...@gmail.com> wrote: > Yup - you need for anything developed outside of Apache. > > Michael McCandless wrote: > > Well, it's the usual process... pull together a big patch, open an issue, > etc. > > > > Probably because it's a large amount of code (I think?) you'll need to > > submit a software grant > > (http://www.apache.org/licenses/software-grant.txt). > > > > Mike > > > > On Thu, Oct 8, 2009 at 2:58 PM, John Wang <john.w...@gmail.com> wrote: > > > >> Awesome! > >> > >> Mike, can you let us know what the process is and the time line? > >> > >> Thanks > >> > >> -John > >> > >> On Thu, Oct 8, 2009 at 11:48 AM, Michael McCandless > >> <luc...@mikemccandless.com> wrote: > >> > >>> +1! > >>> > >>> Mike > >>> > >>> On Thu, Oct 8, 2009 at 2:41 PM, John Wang <john.w...@gmail.com> wrote: > >>> > >>>> Hi guys: > >>>> > >>>> What are your thoughts about contributing Kamikaze as a lucene > >>>> contrib > >>>> package? We just finished porting kamikaze to lucene 2.9. With the new > >>>> 2.9 > >>>> api, it allows us for some more code tuning and optimization > >>>> improvements. > >>>> > >>>> We will be releasing kamikaze, it might a good time to add it to > >>>> the > >>>> lucene contrib package if there is interest. > >>>> > >>>> Thanks > >>>> > >>>> -John > >>>> > >>>> On Thu, Sep 24, 2009 at 6:20 AM, Uwe Schindler <u...@thetaphi.de> > wrote: > >>>> > >>>>> By the way: In the last RC of Lucene 2.9 we added a new method to > >>>>> DocIdSet > >>>>> called isCacheable(). It is used by e.g. CachingWrapperFilter to > >>>>> determine, > >>>>> if a DocIdSet is easy cacheable or must be copied to an > OpenBitSetDISI > >>>>> (the > >>>>> default is false, so all custom DocIdSets are copied to > OpenBitSetDISI > >>>>> by > >>>>> CachingWrapperFilter, even if not needed - if a DocIdSet does not do > >>>>> disk > >>>>> IO > >>>>> and have a fast iterator like e.g. the FieldCache ones in > >>>>> FieldCacheRangeFilter, it should return true; see CHANGES.txt). Maybe > >>>>> this > >>>>> should also be added to Kamikaze, which is a really nice project! > >>>>> Especially > >>>>> filter DocIdSets should pass this method to its delegate (see > >>>>> FilterDocIdSet > >>>>> in Lucene). > >>>>> > >>>>> ----- > >>>>> Uwe Schindler > >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen > >>>>> http://www.thetaphi.de > >>>>> eMail: u...@thetaphi.de > >>>>> > >>>>> > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: John Wang (JIRA) [mailto:j...@apache.org] > >>>>>> Sent: Thursday, September 24, 2009 3:14 PM > >>>>>> To: java-dev@lucene.apache.org > >>>>>> Subject: [jira] Commented: (LUCENE-1458) Further steps towards > >>>>>> flexible > >>>>>> indexing > >>>>>> > >>>>>> > >>>>>> [ https://issues.apache.org/jira/browse/LUCENE- > >>>>>> 1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment- > >>>>>> tabpanel&focusedCommentId=12759112#action_12759112 ] > >>>>>> > >>>>>> John Wang commented on LUCENE-1458: > >>>>>> ----------------------------------- > >>>>>> > >>>>>> Just a FYI: Kamikaze was originally started as our sandbox for > Lucene > >>>>>> contributions until 2.4 is ready. (we needed the DocIdSet/Iterator > >>>>>> abstraction that was migrated from Solr) > >>>>>> > >>>>>> It has three components: > >>>>>> > >>>>>> 1) P4Delta > >>>>>> 2) Logical boolean operations on DocIdSet/Iterators (I have created > a > >>>>>> jira > >>>>>> ticket and a patch for Lucene awhile ago with performance numbers. > It > >>>>>> is > >>>>>> significantly faster than DisjunctionScorer) > >>>>>> 3) algorithm to determine which DocIdSet implementations to use > given > >>>>>> some > >>>>>> parameters, e.g. miniD,maxid,id count etc. It learns and adjust from > >>>>>> the > >>>>>> application behavior if not all parameters are given. > >>>>>> > >>>>>> So please feel free to incorporate anything you see if or move it to > >>>>>> contrib. > >>>>>> > >>>>>> > >>>>>> > >>>>>>> Further steps towards flexible indexing > >>>>>>> --------------------------------------- > >>>>>>> > >>>>>>> Key: LUCENE-1458 > >>>>>>> URL: > >>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458 > >>>>>>> Project: Lucene - Java > >>>>>>> Issue Type: New Feature > >>>>>>> Components: Index > >>>>>>> Affects Versions: 2.9 > >>>>>>> Reporter: Michael McCandless > >>>>>>> Assignee: Michael McCandless > >>>>>>> Priority: Minor > >>>>>>> Attachments: LUCENE-1458-back-compat.patch, > >>>>>>> LUCENE-1458-back- > >>>>>>> > >>>>>> compat.patch, LUCENE-1458-back-compat.patch, LUCENE-1458.patch, > >>>>>> LUCENE- > >>>>>> 1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, > >>>>>> LUCENE-1458.patch, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE- > >>>>>> 1458.tar.bz2, LUCENE-1458.tar.bz2 > >>>>>> > >>>>>>> I attached a very rough checkpoint of my current patch, to get > >>>>>>> early > >>>>>>> feedback. All tests pass, though back compat tests don't pass due > >>>>>>> to > >>>>>>> changes to package-private APIs plus certain bugs in tests that > >>>>>>> happened to work (eg call TermPostions.nextPosition() too many > >>>>>>> times, > >>>>>>> which the new API asserts against). > >>>>>>> [Aside: I think, when we commit changes to package-private APIs > >>>>>>> such > >>>>>>> that back-compat tests don't pass, we could go back, make a branch > >>>>>>> on > >>>>>>> the back-compat tag, commit changes to the tests to use the new > >>>>>>> package private APIs on that branch, then fix nightly build to use > >>>>>>> the > >>>>>>> tip of that branch?o] > >>>>>>> There's still plenty to do before this is committable! This is a > >>>>>>> rather large change: > >>>>>>> * Switches to a new more efficient terms dict format. This still > >>>>>>> uses tii/tis files, but the tii only stores term & long offset > >>>>>>> (not a TermInfo). At seek points, tis encodes term & freq/prox > >>>>>>> offsets absolutely instead of with deltas delta. Also, tis/tii > >>>>>>> are structured by field, so we don't have to record field > >>>>>>> number > >>>>>>> in every term. > >>>>>>> . > >>>>>>> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 > >>>>>>> MB > >>>>>>> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > >>>>>>> . > >>>>>>> RAM usage when loading terms dict index is significantly less > >>>>>>> since we only load an array of offsets and an array of String > >>>>>>> (no > >>>>>>> more TermInfo array). It should be faster to init too. > >>>>>>> . > >>>>>>> This part is basically done. > >>>>>>> * Introduces modular reader codec that strongly decouples terms > >>>>>>> dict > >>>>>>> from docs/positions readers. EG there is no more TermInfo used > >>>>>>> when reading the new format. > >>>>>>> . > >>>>>>> There's nice symmetry now between reading & writing in the > >>>>>>> codec > >>>>>>> chain -- the current docs/prox format is captured in: > >>>>>>> {code} > >>>>>>> FormatPostingsTermsDictWriter/Reader > >>>>>>> FormatPostingsDocsWriter/Reader (.frq file) and > >>>>>>> FormatPostingsPositionsWriter/Reader (.prx file). > >>>>>>> {code} > >>>>>>> This part is basically done. > >>>>>>> * Introduces a new "flex" API for iterating through the fields, > >>>>>>> terms, docs and positions: > >>>>>>> {code} > >>>>>>> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > >>>>>>> {code} > >>>>>>> This replaces TermEnum/Docs/Positions. SegmentReader emulates > >>>>>>> the > >>>>>>> old API on top of the new API to keep back-compat. > >>>>>>> > >>>>>>> Next steps: > >>>>>>> * Plug in new codecs (pulsing, pfor) to exercise the modularity / > >>>>>>> fix any hidden assumptions. > >>>>>>> * Expose new API out of IndexReader, deprecate old API but > >>>>>>> emulate > >>>>>>> old API on top of new one, switch all core/contrib users to the > >>>>>>> new API. > >>>>>>> * Maybe switch to AttributeSources as the base class for > >>>>>>> TermsEnum, > >>>>>>> DocsEnum, PostingsEnum -- this would give readers API > >>>>>>> flexibility > >>>>>>> (not just index-file-format flexibility). EG if someone wanted > >>>>>>> to store payload at the term-doc level instead of > >>>>>>> term-doc-position level, you could just add a new attribute. > >>>>>>> * Test performance & iterate. > >>>>>>> > >>>>>> -- > >>>>>> This message is automatically generated by JIRA. > >>>>>> - > >>>>>> You can reply to this email to add a comment to the issue online. > >>>>>> > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >>>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >>>>>> > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >>>>> > >>>>> > >>>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org > >>> > >>> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > > > > > > -- > - Mark > > http://www.lucidimagination.com > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > >