Hi, Last Thursday, during the EVO, people were talking about lucene/xapian/invenio searching - and the argument was also a bit around where to have metadata vs where to have fulltext, whether to have them split and let invenio combine results from external search with internal metadata.
Some of you had understandable worries it might not be so easy, but actually it is not so bad -- I gave it a try today and I have a working system where lucene reindexes everything as soon as it is updated. It took about 12 hours to finish, but i hit a few issues which i didn't expect, but i still believe it qualifies as an answer which is easy and powerful -- any field can be indexed with any combination of tokenizers, I will describe details laters. The code is here https://svnweb.cern.ch/trac/rcarepo/browser/sandbox/index-bibrecs/src/lucene_updater So, to incrementally reindex data, in a quite safe way might be DEFINITELY FEASIBLE and quite probably also very interesting solution (why not to have two? both are there) Inthe process, i found also this mail that confirms my beliefs: http://web.archiveorange.com/archive/v/cTKYIIPMGx7ltCSZYrpf --- The average time for an update when we commit every 10k docs is around 17ms (the IndexWriter buffer is 100MB). I profiled the application for several hours and I noticed that most of the time is spent in IndexWriter.applyDeletes()->TermDocs.seek(). I changed the BufferedDeletes.terms from HashMap to TreeMap to have the terms ordered and to reduce the number of random seeks on the disk. I run my tests again with the patched Lucene 2.9.1 and the time has dropped from 17ms to 2ms. The index has 18GB and 70 million docs. --- I will describe details later, as i would talk nonsenses now anyway, but you would be surprised how fast my invenio feels now! roman
