Tomas, do you mean that “rebuild_zebra.pl -daemon" is used by 3.22 onward, or 
that it’s the preferable way of doing it but not the actual default? I’m a bit 
confused : ).

 

For the moment, I’m going to write the OAI-PMH importer to use Zebra for 
matching, and we’ll see during the testing phase if it’s fast enough. 

 

In theory, a stale Zebra should only matter if a new record is downloaded via 
OAI-PMH and not added to Zebra before an updated version of that record is 
downloaded and actioned. In theory, that scenario should be rare, although in 
practice I can imagine someone adding a record and then fixing a spelling 
mistake a second or two later, and having those two rapid-fire changes reach 
downstream close together. I may also try to build some heuristics into the 
OAI-PMH importer so that it checks its own OAI-PMH tables before trying to 
check Zebra. That could help mitigate certain scenarios where the Zebra queue 
gets backed up or Zebra dies, but there’s already a trail of previous OAI-PMH 
updates that can be used for matching. It just wouldn’t take into account a 
scenario where a record already exists in the catalogue via a non-OAI-PMH 
import and Zebra is down/unavailable. In that case, there would be duplicate 
records. But without a way of checking the source of truth, I don’t see any 
other options…

 

David Cook

Systems Librarian

Prosentient Systems

72/330 Wattle St, Ultimo, NSW 2007

 

From: Tomas Cohen Arazi [mailto:tomasco...@gmail.com] 
Sent: Wednesday, 9 December 2015 5:41 AM
To: David Cook <dc...@prosentient.com.au>
Cc: Jesse <pianohac...@gmail.com>; koha-devel 
<koha-devel@lists.koha-community.org>
Subject: Re: [Koha-devel] Proposed "metadata" table for Koha

 



2015-12-01 1:44 GMT-03:00 David Cook <dc...@prosentient.com.au 
<mailto:dc...@prosentient.com.au> >:
>
> My main concern about Zebra is with it not being fast enough. Tomas mentioned 
> that Zebra runs updates every 5 seconds, but it looks to me like 
> rebuild_zebra.pl <http://rebuild_zebra.pl>  (via /etc/cron.d/koha-common) is 
> only run every 5 minutes on a Debian package install. At least that was the 
> case in Koha 3.20. That’s a huge gap when it comes to importing records. Say 
> you import record A at 5:01pm… and then you try to import record A again at 
> 5:03 using a matching rule. The update from 5:01pm hasn’t been processed yet, 
> so you wind up with 2 copies of record A in your Koha catalogue.

You are right that the default setup is setting a cronjob to check the queue 
every 5 minutes. I planned to change that default behaviour to set 
USE_INDEXER_DAEMON=yes in /etc/default/koha-common and have the cron line 
commented or conditional to USE_INDEXER_DAEMON=no. But I got distracted the 
last couple weeks before the release and forgot to post a patch for that. 
Anyway, rebuild_zebra.pl <http://rebuild_zebra.pl>  -daemon should be run by 
default.

 

> We run openSUSE and we define our own Zebra indexer, which does run every 5 
> seconds. I haven’t stress tested it yet, but 5 seconds might be a bit long 
> even under ideal circumstances, if an import job is running every 2-3 
> seconds. Sure, that 2-3 seconds might be a bit optimistic… maybe it will also 
> be every 5 seconds. But what happens if the Zebra queue backs up? Someone 
> runs “touch_all_biblios.pl <http://touch_all_biblios.pl> ” or something like 
> that and fills up the zebra queue, while you’re importing records. Zebra is 
> going to be out of date.

True

 

> There needs to be a source of truth, and that’s the metadata record in MySQL. 
> Zebra is an indexed cache, and while usually it doesn’t matter too much if 
> that cache is a little bit stale, it can matter when you’re importing records.

I agree the importing step should rely on the source of truth.

 

> Another scenario… what happens if Zebra is down? You’re going to get 
> duplicate records because the matcher won’t work. That said, we could 
> mitigate that by double-checking that Zebra is actually alive 
> programmatically before commencing an import. There could be other heuristics 
> as well… like not running an import (which uses matching rules) unless the 
> zebraqueue only has X number of waiting updates. Ideally, it would be 0, but 
> that’s unlikely in large systems. It also is impossible if you’re importing 
> at a rate of more than one every 5 seconds (which would be absurdly slow).

I wouldn't create a workaround for Zebra being down only for being able to 
match existing records... I would just print a red box saying the tool is not 
available.

 

> I am curious as to whether the default Zebra update time for Debian packages 
> is 5 minutes or 5 seconds. While it doesn’t affect me too much as I don’t use 
> Debian, it matters for regular users of Koha.

Ok, i'll provide the mentioned patch :-P

--
Tomás Cohen Arazi
Theke Solutions (http://theke.io)
✆ +54 9351 3513384
GPG: B76C 6E7C 2D80 551A C765  E225 0A27 2EA1 B2F3 C15F

 

_______________________________________________
Koha-devel mailing list
Koha-devel@lists.koha-community.org
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

Reply via email to