On 11/9/09 2:56 AM, Michael McCandless wrote:
I think you're asking about the benefit of using "shared doc stores" at
all?

CFX is just the compound format of these shared files; if compound
file is off, then they are still shared, just as separate (.fdx/t,
.tvx/d/f) files.

Oh yeah, that's true. I do mean the shared doc stores in general.

For building up a single large index, I suspect the win is
sizable, if you store fields and compute term vectors.  You save alot
of IO not merging these files, within that one IndexWriter session.

That said, the win is probably less than it used to be, now that we
bulk-copy when merging these files.  Previously, without bulk copy, it
also consumed alot of CPU to merge the files.

And it's true that the gains only apply within one IW session, so I'd
expect this means in practice when building a huge index from scratch
you see sizable gains, but then when rolling smallish updates into the
index over time, there's no real gain. Though that's something we could
[alternatively] pursue improving (eg if we allowed a single segment to
reference multiple doc stores).


Ok, thanks for clarifying.

I do think keeping the IO cost down during merging is important;
removing shared doc stores would be at step backwards (though,
I agree, would simplify things).


Well, I was just wondering if you or anyone else had any numbers that quantify the benefits of the shared stores. If it really helps a lot I agree it's a good thing to have them. But they do add a layer of complexity to the code (and to the way one has to think about segments), so if the win is smallish this might not be desirable. Btw: I'm not trying to say it's required to remove them for parallel indexing. It'd be just be simpler without them. You can think about a segmented parallel index as a matrix of segments. And about the shared doc stores as merging multiple cells in a single row or column of a spreadsheet. It'd be a bit easier if that wasn't possible and it always was a true matrix.

 Michael


Mike

On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch<busch...@gmail.com>  wrote:
Hi,

I'm wondering about the benefits of having the .cfx files. The main
advantage is that you avoid merging (copying) stored fields and TermVectors
during segment merge, right? And I think .cfx files are only shared across
segments if the same IndexWriter is used to flush multiple segments and then
to commit all those segments in a single transaction. Then those segments
share the same .cfx file, correct? And in such a case .cfx files are also
not merged into .cfs files?

How big is usually the win of using .cfx files? I'm wondering, because the
.cfx file is the only one that spans over multiple segments and therefore
adds more complexity to the code. For parallel indexing it'd be nice to not
have those kind of files that belong to multiple segments, especially when
we want to update certain fields.

  Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to