On 10/11/06, Chuck Williams <[EMAIL PROTECTED]> wrote:
David Balmain wrote on 10/10/2006 03:56 PM:
> Actually not using single doc segments was only possible due to the
> fact that I have constant field numbers so both optimizations stem
> from this one change. So it I'm not sure if it is worth answering your
> question but I'll try anyway. It obviously depends if you are storing
> the fields and term-vectors. Most Ferret using are indexing data from
> a database and are only storing an id field and no term-vectors so the
> biggest optimization for them is the merge algorithm I'm using for
> term-infos. On the other hand if you want to highlight the fields,
> (Ferret has a very accurate highlighting algorithm that actually uses
> the queries to get the exact terms and phrases matched) then you need
> to store the field with term-vectors. In this case the merging of
> fields and term-vectors is going to be a lot more important.
Hi David,
I use a rich global field model and use term vectors for fast accurate
excerpting in Lucene. Whether or not to store term vectors is the one
index property that is not fixed in my model. The reason is that my
collections tend to contain a mix of many small email messages and a
comparatively small number of much larger documents. Term vectors are a
significant advantage for excerpting large documents, but add no value
and unnecessarily bloat the index for all the small emails. I use a
size threshold to only store term vectors when the body content of the
field exceeds that threshold.
I personally would always store term vectors since I use a
StandardTokenizer and Stemming. In this case highlighting matches in
small documents is not trivial. Ferret's highlighter matches even
sloppy phrase queries and phrases with gaps between the terms
correctly. I couldn't do this without the use of term vectors.
Would your model in Ferret support that particular field variation? Do
you have an alternative representation to achieve similar benefits? I
suppose it would be possible for the single conceptual field 'body' to
be represented with two physical fields 'smallBody' and 'largeBody'
where the former stores term vectors and the latter does not.
Chuck
If I really wanted to solve this problem I would use this solution. It
is pretty easy to search multiple fields when I need to. Ferret's
Query language even supports it:
smallBody|largeBody:"phrase to search for"
In the end, I think the benifits of my model far outweight the costs.
For me at least anyway.
Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]