Tibor Simko schrieb:

On Tue, 28 Feb 2012, Schindler, Sebastian wrote:


We already uploaded and indexed ~775.000 records. To achieve this, we
had to change the data type of all bibxxx, birec_bibxxx and all
PAIR/WORD/PHRASE tables from MEDIUMT to INT, because there occured
some id-overflow issues at record no. ~600.000.



This may occur indeed, depending on how many index terms your word
breaking procedures generate.  E.g. for some instances of Invenio that
we are running here, we have 1M+ records, and UNSIGNED MEDIUMINT is
still OK for us.  IIRC, UNSIGNED MEDIUMINT should allow for 16,777,215
index terms.  It seems you are generating more than 16M index terms with
600K records?  That sounds like much.  Maybe you don't use stemming?  Or
do you need such a fine tuned word breaking?  You can count index terms
you generate via commands like:

 $ echo "SELECT COUNT(*) FROM idxWORD01F" | /opt/invenio/bin/dbexec
 $ echo "SELECT MAX(id) FROM idxWORD01F" | /opt/invenio/bin/dbexec



Bibindex nearly freezes/runs very very slow when trying to index the
global index.



It could be just slow due to index size.  How big are your idxWORD*
tables, both index tables (MYI) and data tables (MYD)?  Also, have you
tried to optimise your MySQL server parameters such as key_buffer and
friends?



- bibindex -w global --repair => success, but the problem is still
there



Does `bibindex -u admin -w global -k' reports success?



- different flush sizes  (5.000, 25.000, 50.000)



The more the better, depending on your RAM size and on the size of your
bibindex process when it is running.

For some indexes that don't generate many index terms, e.g. title, you
could go as high as `-f 260000', if RAM permits.  So that re-indexing of
all your titles would take only 3 flushes.



The last package (bibindex -w global -f 50000 --id=762635-776582)
threw this exception (invenio.err):

#################################################
Error when putting the term \'\'1846 illustr\'\' into db
(hitlist=intbitset([767550])): (1062, "Duplicate entry \'0\' for key
\'PRIMARY\'")\n'



OK, so the problem seems to be with record 767550 and with the word pair
index (idxPAIR*).  So you can do the above size estimate on this index.

To investigate further, let's see first whether the troubles aren't due
to the indexing part that breaks words in pairs.  If your tables are
clean, can you run:

 $ bibindex -w global -a -i 767550

and see if the error reproducible?  Can you send the MARC representation
of this record so that we can see what characters does it contain?

Best regards


Thanks for your reply!


$ echo "SELECT COUNT(*) FROM idxPAIR01F" | /opt/invenio/bin/dbexec
=> 16.406.912



$ echo "SELECT MAX(id) FROM idxPAIR01F" | /opt/invenio/bin/dbexec
=> 16.406.911


We changed UNSIGNED MEDIUMINT to UNSIGNED INTEGER because we reached the 
size-limit in table 'bib99x'
(max(id) = 23.226.563, count(*) = 16.799.001).
MySQL apparently had performance problems comparing MEDIUMT and INTEGER, so we 
did that data type change on most of the tables.



It seems you are generating more than 16M index terms with
600K records?  That sounds like much.  Maybe you don't use stemming?  Or
do you need such a fine tuned word breaking?

We do use stemming. Please correct me if I'm wrong, but I guess this amount of 
terms per records is generated just because we are uloading relative large 
records (one record in MARC-XML is about 20KB).

How big are your idxWORD*
tables, both index tables (MYI) and data tables (MYD)?

idxPAIR01F:
   1,92G Datasize
   421M Indexsize

idxPAIR01R:
   868M Datasize
   9M Indexsize



Also, have you
tried to optimise your MySQL server parameters such as key_buffer and
friends?




Yes, we tried to increase all those buffer parameter of our MySQL-Instance.
The MySQL-Server only uses 30% of its available keybuffer when 
indexing/flushing.





Does `bibindex -u admin -w global -k' reports success?

Well, it seems there still is something corrupted after running the last bibindexes as packages: 2012-02-29 13:45:38 
--> Task #2437 started. 2012-02-29 13:45:38 --> idxWORD01F has stemming enabled, language en 2012-02-29 
13:45:39 --> idxWORD01F contains 11267529 words from 776530 records 2012-02-29 13:45:39 --> EMERGENCY: 
idxWORD01F needs to repair 1000 of 776530 index records 2012-02-29 13:45:39 --> idxPAIR01F contains 16406912 
words from 726583 records 2012-02-29 13:45:40 --> EMERGENCY: idxPAIR01F needs to repair 13948 of 726583 index 
records 2012-02-29 13:45:40 --> idxPHRASE01F contains 8689545 words from 712635 records 2012-02-29 13:45:40 
--> idxPHRASE01F is in consistent state 2012-02-29 13:45:40 --> Task #2437 finished. [DONE] I just started a 
bibindex -u admin -w global --repair task in order to fix it. That repair task seems to be very slow, too. Its 
progress is "ixPAIR01F flushed 0/745112 words" for about 50 minutes now. The 1000 inconsistencies in 
idxWORD01F have been repaired sucessfully.

OK, so the problem seems to be with record 767550 and with the word pair
index (idxPAIR*).  So you can do the above size estimate on this index.

I forgot to mention that not only one record threw that error. We received 
25.000 of that erorrs, each related to a different record. That's why I believe 
in a index/database based error. I can't imagine the next 25.000 records being 
corrupted...

$ bibindex -w global -a -i 767550

I will try that right after bibindex --repair finished! (IF it finishes in "finite 
time" ...) Kind regards, Sebastian Schindler

-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------

Kennen Sie schon unsere app? http://www.fz-juelich.de/app

Reply via email to