Re: Questions about doc store files (.cfx)

Michael McCandless Mon, 09 Nov 2009 09:01:26 -0800

On Mon, Nov 9, 2009 at 10:10 AM, Michael Busch <busch...@gmail.com> wrote:
>> I think you're asking about the benefit of using "shared doc stores" at
>> all?
>>
>> CFX is just the compound format of these shared files; if compound
>> file is off, then they are still shared, just as separate (.fdx/t,
>> .tvx/d/f) files.
>>
>>
>
> Oh yeah, that's true. I do mean the shared doc stores in general.
>
>> For building up a single large index, I suspect the win is
>> sizable, if you store fields and compute term vectors.  You save alot
>> of IO not merging these files, within that one IndexWriter session.
>>
>> That said, the win is probably less than it used to be, now that we
>> bulk-copy when merging these files.  Previously, without bulk copy, it
>> also consumed alot of CPU to merge the files.
>>
>> And it's true that the gains only apply within one IW session, so I'd
>> expect this means in practice when building a huge index from scratch
>> you see sizable gains, but then when rolling smallish updates into the
>> index over time, there's no real gain. Though that's something we could
>> [alternatively] pursue improving (eg if we allowed a single segment to
>> reference multiple doc stores).
>>
>>
>
> Ok, thanks for clarifying.
>
>> I do think keeping the IO cost down during merging is important;
>> removing shared doc stores would be at step backwards (though,
>> I agree, would simplify things).
>>
>>
>
> Well, I was just wondering if you or anyone else had any numbers that
> quantify the benefits of the shared stores. If it really helps a lot I agree
> it's a good thing to have them. But they do add a layer of complexity to the
> code (and to the way one has to think about segments), so if the win is
> smallish this might not be desirable


Alas, I don't have any benchmarks offhand... if you want to run one,
you should be able to hardwire flushDocStores=true in
IndexWriter.doFlushInternal?  I think that'd turn off the sharing
without breaking things (run the tests to be sure ;) ).

> Btw: I'm not trying to say it's
> required to remove them for parallel indexing. It'd be just be simpler
> without them. You can think about a segmented parallel index as a matrix of
> segments. And about the shared doc stores as merging multiple cells in a
> single row or column of a spreadsheet. It'd be a bit easier if that wasn't
> possible and it always was a true matrix.

I agree, not sharing the stores would make things simpler.  Wouldn't
the parallel indexes be able to "privately" share their own stores?
Ie, how the sharing happens need not be in sync across the main &
parallel indexes?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Questions about doc store files (.cfx)

Reply via email to