morning all,
I have a reasonable sized index, approx 5Gig & 2Million documents, that I
update daily. I use a number of worker threads to create a number of small
indexes which I merge together to get 1 index of about 100000 documents and
500Meg in size. I then merge this into the main index. This is were my
problem exists. the merging of the main and temp index not only take a long
time, but causes an excessive amount of disk IO. the final merging results
in 30Gigs+ of data reading and 25+Gigs of data writing. This seems more
than a bit excessive.
my code goes along the lines of
Dim idxs As New System.Collections.Generic.List(Of
Lucene.Net.Store.Directory)
1. start 5 worker threads each with their index, reading from a message
queue. into idxs
Dim tIndex As Lucene.Net.Index.IndexWriter = Nothing
Dim fIndex As Lucene.Net.Index.IndexWriter = Nothing
Dim tmpIdx As Lucene.Net.Store.Directory
tmpIdx =
Lucene.Net.Store.FSDirectory.GetDirectory(System.IO.Path.Combine(Configurati
on.TempIndexPath, "wrk"), True)
2. when done, merge the 5 work indexes into 1 temp index
( up till now disk IO etc as I would expect )
tIndex = New Lucene.Net.Index.IndexWriter(tmpIdx, New
Lucene.Net.Analysis.Standard.StandardAnalyzer(sWords.ToArray), True)
tIndex.AddIndexes(idxs.ToArray)
tIndex.Close()
3 merge temp index with main index
( disk IO goes haywire here)
fIndex= New Lucene.Net.Index.IndexWriter(Configuration.IndexPath, New
Lucene.Net.Analysis.Standard.StandardAnalyzer(sWords.ToArray), False)
fIndex.SetMergeFactor(10000)
fIndex.SetUseCompoundFile(True)
fIndex.AddIndexes(New Lucene.Net.Store.Directory() {tmpIdx})
fIndex.Close()
Is there a better way of maintaining or implementing an index of this size,
and growing?
Thanks
David