On 29.02.2012 11:37, Tibor Simko wrote:
Hi!
On Tue, 28 Feb 2012, Schindler, Sebastian wrote:
We already uploaded and indexed ~775.000 records. To achieve this, we
had to change the data type of all bibxxx, birec_bibxxx and all
PAIR/WORD/PHRASE tables from MEDIUMT to INT, because there occured
some id-overflow issues at record no. ~600.000.
This may occur indeed, depending on how many index terms your word
breaking procedures generate. E.g. for some instances of Invenio that
we are running here, we have 1M+ records, and UNSIGNED MEDIUMINT is
still OK for us. IIRC, UNSIGNED MEDIUMINT should allow for 16,777,215
index terms. It seems you are generating more than 16M index terms with
600K records? That sounds like much. Maybe you don't use stemming?
Probably, before investigating internal details, we should check what
are the words that generate this index and why we get that many. I'm
really not involved in our project here, but it might very well be that
we could leave out a lot of the actual "words" without any loss of
information and thus break the index down considerably. AFAIK the data
in question is still based upon some xsl I wrote while ago for ingest
conversion and the idea of my attempt back then was "what can I get out
of a single record and what is the most complete output inhaling the
most complete Marc philosophy". This approach might not be optimal for
the usecase.
So from the discussion, the question that is raised in me is "how could
I check how the index in question looks like, what is it's contents, in
a way to be able to judge if we really need this type of indexing?"
Probably, it's just a librarians problem and not a computer issue. (You
know, we librarians are from the "all you can get" party ;)
@Sebastian: could it be that the headache is caused by 995 C5? I'd guess
that it is. I think a huge amount of words would be generated here by $m
subfield: it just lists _all_ authors. However, the really relevant
entry in this category is $r only as the rest is only for display logic
and we don't display it so we don't need it. (Note, that given your
other *important* thesis project it might even be that 995 C5 isn't
required at all and that even $r is a "nice to have" only but no deal
breaker.)
Even if we'd store some bibliographic infos in 995 C5, it would
definitely make a lot of sense to write $m as "$m Mueller, H et al."
instead of listing all authors. One might additionally strongly consider
to drop $t field.
If I don't missunderstand Tibors comment this could save a ton of words
and shrink this index to near nothing and solve the problem entirely.
To investigate further, let's see first whether the troubles aren't due
to the indexing part that breaks words in pairs. If your tables are
clean, can you run:
$ bibindex -w global -a -i 767550
and see if the error reproducible? Can you send the MARC representation
of this record so that we can see what characters does it contain?
If it is http://zb0035.zb.kfa-juelich.de/record/767550/ (Sebastian?) and
my assumption about 995 C5 above is correct this record is quite
screwed. No funny chars, the input is utf-8 encoded english, but I see a
lot of repeated SGR-codes here, while each of those should appear
exactly once. It even has repeated $r subfields which shouldn't happen
at all either, and even those repeated subfields refer to the same SGRs
even repeatedly:
999C5 $$p19-20$$rSGR:36849106381$$rSGR:36849106381$$rSGR:36849106381
999C5 $$rSGR:36849106381
999C5 $$mCrawfurd J.;$$p219-$$rSGR:36849106381$$rSGR:36849106381$$v1$$y1834
999C5 $$mArnold D.;$$n4$$p505-525$$rSGR:24744438016$$tJournal of
Agrarian Change$$v5$$y2005
999C5 $$mBiswas K.;$$rSGR:36849106381$$rSGR:36849106381$$y1950
and so on. So I can imagine that any software might get quite confused
from this record. It confuses me even if I just look at it as a
librarian ;) Looks like some problem in the initial conversion routine.
@Sebastian: if and only if(!) you have a lot of spare time... For the
usecase at hand it might be a lot more interesting than reindexing to
try to just load the data with only a 995 C5 $r field (ie. dropping all
other subfields). It might also be interesting how large the indices get
then and if it could live again on the smaller integer type. If this
still doesn't work out it might be interesting to try loading without
any of the 995 C5 stanzas and see if this works out nicely. I have a
feeling though that that shortening $m to the first author, or dropping
it, and dropping $t already can be sufficient. This is then almost
Inspire indexing and they have a proof of concept for ~10^6 records in a
productive system. Would make me wonder if we can't reproduce that.
--
Kind regards,
Alexander Wagner
Subject Specialist
Central Library
52425 Juelich
mail : [email protected]
phone: +49 2461 61-1586
Fax : +49 2461 61-6103
www.fz-juelich.de/zb/DE/zb-fi
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Kennen Sie schon unsere app? http://www.fz-juelich.de/app