Re: Questions about doc store files (.cfx)

Michael Busch Mon, 09 Nov 2009 07:11:07 -0800

On 11/9/09 2:56 AM, Michael McCandless wrote:

I think you're asking about the benefit of using "shared doc stores" at
all?


CFX is just the compound format of these shared files; if compound
file is off, then they are still shared, just as separate (.fdx/t,
.tvx/d/f) files.

Oh yeah, that's true. I do mean the shared doc stores in general.

For building up a single large index, I suspect the win is
sizable, if you store fields and compute term vectors.  You save alot
of IO not merging these files, within that one IndexWriter session.

That said, the win is probably less than it used to be, now that we
bulk-copy when merging these files.  Previously, without bulk copy, it
also consumed alot of CPU to merge the files.

And it's true that the gains only apply within one IW session, so I'd
expect this means in practice when building a huge index from scratch
you see sizable gains, but then when rolling smallish updates into the
index over time, there's no real gain. Though that's something we could
[alternatively] pursue improving (eg if we allowed a single segment to
reference multiple doc stores).


Ok, thanks for clarifying.

I do think keeping the IO cost down during merging is important;
removing shared doc stores would be at step backwards (though,
I agree, would simplify things).

Well, I was just wondering if you or anyone else had any numbers thatquantify the benefits of the shared stores. If it really helps a lot Iagree it's a good thing to have them. But they do add a layer ofcomplexity to the code (and to the way one has to think about segments),so if the win is smallish this might not be desirable. Btw: I'm nottrying to say it's required to remove them for parallel indexing. It'dbe just be simpler without them. You can think about a segmentedparallel index as a matrix of segments. And about the shared doc storesas merging multiple cells in a single row or column of a spreadsheet.It'd be a bit easier if that wasn't possible and it always was a truematrix.


 Michael

Mike

On Mon, Nov 9, 2009 at 3:17 AM, Michael Busch<busch...@gmail.com>  wrote:

Hi,

I'm wondering about the benefits of having the .cfx files. The main
advantage is that you avoid merging (copying) stored fields and TermVectors
during segment merge, right? And I think .cfx files are only shared across
segments if the same IndexWriter is used to flush multiple segments and then
to commit all those segments in a single transaction. Then those segments
share the same .cfx file, correct? And in such a case .cfx files are also
not merged into .cfs files?

How big is usually the win of using .cfx files? I'm wondering, because the
.cfx file is the only one that spans over multiple segments and therefore
adds more complexity to the code. For parallel indexing it'd be nice to not
have those kind of files that belong to multiple segments, especially when
we want to update certain fields.

  Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Questions about doc store files (.cfx)

Reply via email to