On 10/11/06, Ning Li <[EMAIL PROTECTED]> wrote:
On 10/10/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 10/10/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Maybe I missed it, but I was surprised that nobody here wondered about the
algorithm and data structure changes that Dave Balmain made in Ferret, to make it go
faster (than Java Lucene).
>
> Not using single doc segments for buffered docs has come up
>
http://www.nabble.com/-jira--Created%3A-%28LUCENE-565%29-Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-Results-Provided%29-tf1580652.html#a6177808
After reading the interview article, I thought not using single doc
segments contributed most of the indexing performance improvement. But
in the mailing list discussion on "Global field semantics", Dave
Balmain mentioned most of the indexing performance benefits come from
having constant field numbers, which greatly optimizes the merging of
term vectors and stored fields.
Exactly how much performance improvement each of these two
optimizations provides will depend on a workload. But in general, is
one playing a more significant role than the other? What about for the
benchmark workload Yonik pointed out at
http://rubyforge.org/forum/forum.php?forum_id=9058 ?
Cheers,
Ning
Actually not using single doc segments was only possible due to the
fact that I have constant field numbers so both optimizations stem
from this one change. So it I'm not sure if it is worth answering your
question but I'll try anyway. It obviously depends if you are storing
the fields and term-vectors. Most Ferret using are indexing data from
a database and are only storing an id field and no term-vectors so the
biggest optimization for them is the merge algorithm I'm using for
term-infos. On the other hand if you want to highlight the fields,
(Ferret has a very accurate highlighting algorithm that actually uses
the queries to get the exact terms and phrases matched) then you need
to store the field with term-vectors. In this case the merging of
fields and term-vectors is going to be a lot more important.
Cheers,
Dave
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]