Hi all,

I am sending some data for your contemplation. Indexing of the whole
inspiretest went very smoothly:


746421 records
total indexing time 00:20:27

on average:
608 records/s
36470 recs/min
or 10K recs in 18s
Please see indexing-time.png for a plot of time spent for indexing 10K recs.

Index size: 1.1 GB (not optimized -- therefore bigger)
Distribution of terms per field: please see attached pictures (field-dist-x.png)

Parameters: '-Xms32m,-Xmx900m'
Commit after: 10000 recs
Machine: 5 Intel(R) Xeon(R) CPU L5410  @ 2.33GHz (in python only one
thread, no idea how many threads inside JVM)
Python 2.4
Java 1.6.18

Roman



On Mon, Aug 16, 2010 at 7:00 PM, Brooks, Travis C.
<[email protected]> wrote:
>
> On Aug 16, 2010, at 6:37 AM, Tibor Simko wrote:
>
>> On Mon, 16 Aug 2010, Roman Chyla wrote:
>>> because index is created by a writer which has instant access to the
>>> invenio, and because values are only indexed (not stored - it is the
>>> invenio, that takes care of that from database), those issues are
>>> eliminated. The lucene is there as a search provider, returning docids
>>> only
>>
>> For the benefit of everyone not present today, we also discussed that
>> Solr might not need data objects stored for its faceting either, which
>> gives us this nice possibility of (1) Invenio storing data objects and
>> (2) Invenio/Lucene/Solr storing indexes.  This would make the potential
>> co-existence between the tools much easier.  (And Invenio modules that
>> may have to rely on instantaneous there/not-there response, such as
>> parts of WebSubmit, could still use MySQL tables.)
>
> This is very encouraging!   Thanks Roman for demonstrating the 
> proof-of-concept.   This separation makes some bit of sense to me, and after 
> all doesn't sound so very different than running a bibindex daemon, no?
>
>>
>>> it was agreed we will do some tests when we find the box with data
>>> (all inspire records), some time this week
>>
>> The freeest one so far seems to be INSPIRETEST, so maybe we shall borrow
>> its cycles for a throw-away testing of Lucene re-indexing the whole
>> INSPIRE later this week.
>
> Yes, INSPIRE-Test seems perfect for this, note:
>
> 1) Tibor saved aside the work that was done there in some long ago test.
> 2) The current data there is not the correct MARC etc.  It should be updated 
> from dev or prod...
> 3)...but this is still subject to the max_allowed_packet issue which I am 
> still trying to get around, to get a clean dump.  I may have one from dev or 
> prod soon, in which case I will roll it out on test db unless I hear 
> otherwise from Roman.
>
>
> Travis
>
>
>
>
>
>>
>> Best regards
>> --
>> Tibor Simko
>
> Travis C. Brooks
> Manager of Information Systems & SPIRES/INSPIRE
> SLAC National Accelerator Laboratory Library
> http://www.slac.stanford.edu/spires/
>
>
>
>
>

<<attachment: field-dist-1.png>>

<<attachment: field-dist-2.png>>

<<attachment: indexing-time.png>>

Reply via email to