Hi Martin: On Fri, 14 Mar 2008, Martin Vesely wrote:
[d-rank testing on CDSDEV] > So I was thinking that the citation indexing could be switched off > during the time of upload. Please let me know what do you think > about it. The "indexing" Tony spoke about is actually the low-level MySQL indexing happening for the bib99x table. It cannot be switched off, because this table is an "authority" table for cites. In the future we may indeed speed bibupload by not populating bib99x table at all, and store cites only in MARCXML (1 SQL INSERT = very fast). But then the citation indexer/searcher would have to be updated to look inside MARCXML and to create custom citation indexes and not to look inside bibxxx tables. This is a "big change" that we can undertake in the future. (We have started to some extent by populating idxPHRASE tables, but they are still switched off for the next release.). So, for immediate d-rank testing, I think it is best to continue with the usual bibupload. If it takes as much as 2 secs per rec, then I can look at profiling the queries and perhaps tweak the table definition to speed this up. (Tweaking helped a lot for the Indico Search site, by factor of 20 or so, where one bibxxx tables used to hold very similar values, so MySQL couldn't effectively use indexes. But I guess no tweaking will help much in your case, because the cites are usually "very different".) Another speed-up possibility would be not to store "useless" information such as 999 $m if you don't need it for d-rank, and store only recognized references like $s and $r. You can write a little XSLT to retain only $s and $r from Tony's files. Maybe this would be enough for d-rank testing? We basically store only $s in the Inspire test site and Marko reported the speed of something like 2 recs/sec for all metadata updates (not only cites). Best regards -- Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>
