Hi Martin:

On Fri, 14 Mar 2008, Martin Vesely wrote:

[d-rank testing on CDSDEV]
> So I was thinking that the citation indexing could be switched off
> during the time of upload. Please let me know what do you think
> about it.

The "indexing" Tony spoke about is actually the low-level MySQL
indexing happening for the bib99x table.  It cannot be switched off,
because this table is an "authority" table for cites.

In the future we may indeed speed bibupload by not populating bib99x
table at all, and store cites only in MARCXML (1 SQL INSERT = very
fast).  But then the citation indexer/searcher would have to be
updated to look inside MARCXML and to create custom citation indexes
and not to look inside bibxxx tables.  This is a "big change" that we
can undertake in the future.  (We have started to some extent by
populating idxPHRASE tables, but they are still switched off for the
next release.).

So, for immediate d-rank testing, I think it is best to continue with
the usual bibupload.  If it takes as much as 2 secs per rec, then I
can look at profiling the queries and perhaps tweak the table
definition to speed this up.  (Tweaking helped a lot for the Indico
Search site, by factor of 20 or so, where one bibxxx tables used to
hold very similar values, so MySQL couldn't effectively use indexes.
But I guess no tweaking will help much in your case, because the cites
are usually "very different".)

Another speed-up possibility would be not to store "useless"
information such as 999 $m if you don't need it for d-rank, and store
only recognized references like $s and $r.  You can write a little
XSLT to retain only $s and $r from Tony's files.  Maybe this would be
enough for d-rank testing?  We basically store only $s in the Inspire
test site and Marko reported the speed of something like 2 recs/sec
for all metadata updates (not only cites).

Best regards
-- 
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>

Reply via email to