Hi,

Last Thursday, during the EVO, people were talking about
lucene/xapian/invenio searching - and the argument was also a bit
around where to have metadata vs where to have fulltext, whether to
have them split and let invenio combine results from external search
with internal metadata.

Some of you had understandable worries it might not be so easy, but
actually it is not so bad -- I gave it a try today and I have a
working system where lucene reindexes everything as soon as it is
updated. It took about 12 hours to finish, but i hit a few issues
which i didn't expect, but i still believe it qualifies as an answer
which is easy and powerful -- any field can be indexed with any
combination of tokenizers, I will describe details laters. The code is
here


https://svnweb.cern.ch/trac/rcarepo/browser/sandbox/index-bibrecs/src/lucene_updater


So, to incrementally reindex data, in a quite safe way might be
DEFINITELY FEASIBLE and quite probably also very interesting solution
(why not to have two? both are there)

Inthe process, i found also this mail that confirms my beliefs:
http://web.archiveorange.com/archive/v/cTKYIIPMGx7ltCSZYrpf

---
The average time for an update when we commit every 10k docs is around
17ms (the IndexWriter buffer is 100MB). I profiled the application for
several hours and I noticed that most of the time is spent in
IndexWriter.applyDeletes()->TermDocs.seek(). I changed the
BufferedDeletes.terms from HashMap to TreeMap to have the terms
ordered and to reduce the number of random seeks on the disk.

I run my tests again with the patched Lucene 2.9.1 and the time has
dropped from 17ms to 2ms. The index has 18GB and 70 million docs.
---

I will describe details later, as i would talk nonsenses now anyway,
but you would be surprised how fast my invenio feels now!

roman

Reply via email to