Woops sorry I missed that! Yes this'll be our first test :)
Mike On Tue, Oct 13, 2009 at 4:58 PM, Michael Busch <busch...@gmail.com> wrote: > On 10/13/09 9:43 AM, Michael Busch wrote: >> >> Shall we first remove the remaining deprecations from the indexer package? >> There are not many more left, shouldn't be much work. >> > > I wasn't quick enough for you :) Working on LUCENE-1979 now - that will be > the first test on how good svn merge is! > > Michael > >> Michael >> >> On 10/13/09 5:47 AM, Michael McCandless wrote: >>> >>> OK I will cut a branch& commit Mark's last patch onto it, unless >>> anyone has objections soonish... >>> >>> I'll also branch (twig?) the back compat branch so we can commit the >>> patch there as well. >>> >>> Mike >>> >>> On Mon, Oct 12, 2009 at 10:50 PM, Mark Miller<markrmil...@gmail.com> >>> wrote: >>>> >>>> SVN is about as good at merging branches as any of us are with a patch >>>> and trunk unfortunately. But that can still be somewhat more convenient >>>> than all these huge patches, with different people at different stages. >>>> >>>> Depends on how many people end up working on this though. Any more than >>>> 2, and I think the branch has got to be worth it. >>>> >>>> From my perspective, it doesn't make any of the merging process any >>>> easier - but it can be easier than juggling all these patches - you have >>>> a central code base that can always be targeted for current merging. >>>> >>>> Michael Busch wrote: >>>>> >>>>> I think it's supposed to work pretty good - though I have no personal >>>>> experience with merging branches with svn. >>>>> >>>>> I think we should try it - then we'll know! :) >>>>> >>>>> Michael >>>>> >>>>> On 10/12/09 12:32 PM, Michael McCandless (JIRA) wrote: >>>>>> >>>>>> [ >>>>>> >>>>>> https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764799#action_12764799 >>>>>> ] >>>>>> >>>>>> Michael McCandless commented on LUCENE-1458: >>>>>> -------------------------------------------- >>>>>> >>>>>> bq. Shall we create a flexible-indexing branch and commit this? >>>>>> >>>>>> I think this is a good idea. >>>>>> >>>>>> But I haven't played heavily w/ svn& branching. EG if we branch >>>>>> now, and trunk moves fast (which it still is w/ deprecation >>>>>> removals), are we going to have conflicts? Or... is svn good about >>>>>> merging branches? >>>>>> >>>>>> >>>>>>> Further steps towards flexible indexing >>>>>>> --------------------------------------- >>>>>>> >>>>>>> Key: LUCENE-1458 >>>>>>> URL: >>>>>>> https://issues.apache.org/jira/browse/LUCENE-1458 >>>>>>> Project: Lucene - Java >>>>>>> Issue Type: New Feature >>>>>>> Components: Index >>>>>>> Affects Versions: 2.9 >>>>>>> Reporter: Michael McCandless >>>>>>> Assignee: Michael McCandless >>>>>>> Priority: Minor >>>>>>> Attachments: LUCENE-1458-back-compat.patch, >>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, >>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458-back-compat.patch, >>>>>>> LUCENE-1458-back-compat.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>>> LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, >>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, >>>>>>> LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, LUCENE-1458.tar.bz2, >>>>>>> LUCENE-1458.tar.bz2 >>>>>>> >>>>>>> >>>>>>> I attached a very rough checkpoint of my current patch, to get early >>>>>>> feedback. All tests pass, though back compat tests don't pass due to >>>>>>> changes to package-private APIs plus certain bugs in tests that >>>>>>> happened to work (eg call TermPostions.nextPosition() too many times, >>>>>>> which the new API asserts against). >>>>>>> [Aside: I think, when we commit changes to package-private APIs such >>>>>>> that back-compat tests don't pass, we could go back, make a branch on >>>>>>> the back-compat tag, commit changes to the tests to use the new >>>>>>> package private APIs on that branch, then fix nightly build to use >>>>>>> the >>>>>>> tip of that branch?o] >>>>>>> There's still plenty to do before this is committable! This is a >>>>>>> rather large change: >>>>>>> * Switches to a new more efficient terms dict format. This still >>>>>>> uses tii/tis files, but the tii only stores term& long offset >>>>>>> (not a TermInfo). At seek points, tis encodes term& >>>>>>> freq/prox >>>>>>> offsets absolutely instead of with deltas delta. Also, tis/tii >>>>>>> are structured by field, so we don't have to record field number >>>>>>> in every term. >>>>>>> . >>>>>>> On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB >>>>>>> -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 >>>>>>> MB). >>>>>>> . >>>>>>> RAM usage when loading terms dict index is significantly less >>>>>>> since we only load an array of offsets and an array of String >>>>>>> (no >>>>>>> more TermInfo array). It should be faster to init too. >>>>>>> . >>>>>>> This part is basically done. >>>>>>> * Introduces modular reader codec that strongly decouples terms >>>>>>> dict >>>>>>> from docs/positions readers. EG there is no more TermInfo used >>>>>>> when reading the new format. >>>>>>> . >>>>>>> There's nice symmetry now between reading& writing in the >>>>>>> codec >>>>>>> chain -- the current docs/prox format is captured in: >>>>>>> {code} >>>>>>> FormatPostingsTermsDictWriter/Reader >>>>>>> FormatPostingsDocsWriter/Reader (.frq file) and >>>>>>> FormatPostingsPositionsWriter/Reader (.prx file). >>>>>>> {code} >>>>>>> This part is basically done. >>>>>>> * Introduces a new "flex" API for iterating through the fields, >>>>>>> terms, docs and positions: >>>>>>> {code} >>>>>>> FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum >>>>>>> {code} >>>>>>> This replaces TermEnum/Docs/Positions. SegmentReader emulates >>>>>>> the >>>>>>> old API on top of the new API to keep back-compat. >>>>>>> >>>>>>> Next steps: >>>>>>> * Plug in new codecs (pulsing, pfor) to exercise the modularity / >>>>>>> fix any hidden assumptions. >>>>>>> * Expose new API out of IndexReader, deprecate old API but emulate >>>>>>> old API on top of new one, switch all core/contrib users to the >>>>>>> new API. >>>>>>> * Maybe switch to AttributeSources as the base class for >>>>>>> TermsEnum, >>>>>>> DocsEnum, PostingsEnum -- this would give readers API >>>>>>> flexibility >>>>>>> (not just index-file-format flexibility). EG if someone wanted >>>>>>> to store payload at the term-doc level instead of >>>>>>> term-doc-position level, you could just add a new attribute. >>>>>>> * Test performance& iterate. >>>>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>>>> >>>> >>>> -- >>>> - Mark >>>> >>>> http://www.lucidimagination.com >>>> >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>>> >>>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-dev-h...@lucene.apache.org >>> >>> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org