Hi all, I am sending some data for your contemplation. Indexing of the whole inspiretest went very smoothly:
746421 records total indexing time 00:20:27 on average: 608 records/s 36470 recs/min or 10K recs in 18s Please see indexing-time.png for a plot of time spent for indexing 10K recs. Index size: 1.1 GB (not optimized -- therefore bigger) Distribution of terms per field: please see attached pictures (field-dist-x.png) Parameters: '-Xms32m,-Xmx900m' Commit after: 10000 recs Machine: 5 Intel(R) Xeon(R) CPU L5410 @ 2.33GHz (in python only one thread, no idea how many threads inside JVM) Python 2.4 Java 1.6.18 Roman On Mon, Aug 16, 2010 at 7:00 PM, Brooks, Travis C. <[email protected]> wrote: > > On Aug 16, 2010, at 6:37 AM, Tibor Simko wrote: > >> On Mon, 16 Aug 2010, Roman Chyla wrote: >>> because index is created by a writer which has instant access to the >>> invenio, and because values are only indexed (not stored - it is the >>> invenio, that takes care of that from database), those issues are >>> eliminated. The lucene is there as a search provider, returning docids >>> only >> >> For the benefit of everyone not present today, we also discussed that >> Solr might not need data objects stored for its faceting either, which >> gives us this nice possibility of (1) Invenio storing data objects and >> (2) Invenio/Lucene/Solr storing indexes. This would make the potential >> co-existence between the tools much easier. (And Invenio modules that >> may have to rely on instantaneous there/not-there response, such as >> parts of WebSubmit, could still use MySQL tables.) > > This is very encouraging! Thanks Roman for demonstrating the > proof-of-concept. This separation makes some bit of sense to me, and after > all doesn't sound so very different than running a bibindex daemon, no? > >> >>> it was agreed we will do some tests when we find the box with data >>> (all inspire records), some time this week >> >> The freeest one so far seems to be INSPIRETEST, so maybe we shall borrow >> its cycles for a throw-away testing of Lucene re-indexing the whole >> INSPIRE later this week. > > Yes, INSPIRE-Test seems perfect for this, note: > > 1) Tibor saved aside the work that was done there in some long ago test. > 2) The current data there is not the correct MARC etc. It should be updated > from dev or prod... > 3)...but this is still subject to the max_allowed_packet issue which I am > still trying to get around, to get a clean dump. I may have one from dev or > prod soon, in which case I will roll it out on test db unless I hear > otherwise from Roman. > > > Travis > > > > > >> >> Best regards >> -- >> Tibor Simko > > Travis C. Brooks > Manager of Information Systems & SPIRES/INSPIRE > SLAC National Accelerator Laboratory Library > http://www.slac.stanford.edu/spires/ > > > > >
<<attachment: field-dist-1.png>>
<<attachment: field-dist-2.png>>
<<attachment: indexing-time.png>>
