Tomas, do you mean that “rebuild_zebra.pl -daemon" is used by 3.22 onward, or that it’s the preferable way of doing it but not the actual default? I’m a bit confused : ).
For the moment, I’m going to write the OAI-PMH importer to use Zebra for matching, and we’ll see during the testing phase if it’s fast enough. In theory, a stale Zebra should only matter if a new record is downloaded via OAI-PMH and not added to Zebra before an updated version of that record is downloaded and actioned. In theory, that scenario should be rare, although in practice I can imagine someone adding a record and then fixing a spelling mistake a second or two later, and having those two rapid-fire changes reach downstream close together. I may also try to build some heuristics into the OAI-PMH importer so that it checks its own OAI-PMH tables before trying to check Zebra. That could help mitigate certain scenarios where the Zebra queue gets backed up or Zebra dies, but there’s already a trail of previous OAI-PMH updates that can be used for matching. It just wouldn’t take into account a scenario where a record already exists in the catalogue via a non-OAI-PMH import and Zebra is down/unavailable. In that case, there would be duplicate records. But without a way of checking the source of truth, I don’t see any other options… David Cook Systems Librarian Prosentient Systems 72/330 Wattle St, Ultimo, NSW 2007 From: Tomas Cohen Arazi [mailto:tomasco...@gmail.com] Sent: Wednesday, 9 December 2015 5:41 AM To: David Cook <dc...@prosentient.com.au> Cc: Jesse <pianohac...@gmail.com>; koha-devel <koha-devel@lists.koha-community.org> Subject: Re: [Koha-devel] Proposed "metadata" table for Koha 2015-12-01 1:44 GMT-03:00 David Cook <dc...@prosentient.com.au <mailto:dc...@prosentient.com.au> >: > > My main concern about Zebra is with it not being fast enough. Tomas mentioned > that Zebra runs updates every 5 seconds, but it looks to me like > rebuild_zebra.pl <http://rebuild_zebra.pl> (via /etc/cron.d/koha-common) is > only run every 5 minutes on a Debian package install. At least that was the > case in Koha 3.20. That’s a huge gap when it comes to importing records. Say > you import record A at 5:01pm… and then you try to import record A again at > 5:03 using a matching rule. The update from 5:01pm hasn’t been processed yet, > so you wind up with 2 copies of record A in your Koha catalogue. You are right that the default setup is setting a cronjob to check the queue every 5 minutes. I planned to change that default behaviour to set USE_INDEXER_DAEMON=yes in /etc/default/koha-common and have the cron line commented or conditional to USE_INDEXER_DAEMON=no. But I got distracted the last couple weeks before the release and forgot to post a patch for that. Anyway, rebuild_zebra.pl <http://rebuild_zebra.pl> -daemon should be run by default. > We run openSUSE and we define our own Zebra indexer, which does run every 5 > seconds. I haven’t stress tested it yet, but 5 seconds might be a bit long > even under ideal circumstances, if an import job is running every 2-3 > seconds. Sure, that 2-3 seconds might be a bit optimistic… maybe it will also > be every 5 seconds. But what happens if the Zebra queue backs up? Someone > runs “touch_all_biblios.pl <http://touch_all_biblios.pl> ” or something like > that and fills up the zebra queue, while you’re importing records. Zebra is > going to be out of date. True > There needs to be a source of truth, and that’s the metadata record in MySQL. > Zebra is an indexed cache, and while usually it doesn’t matter too much if > that cache is a little bit stale, it can matter when you’re importing records. I agree the importing step should rely on the source of truth. > Another scenario… what happens if Zebra is down? You’re going to get > duplicate records because the matcher won’t work. That said, we could > mitigate that by double-checking that Zebra is actually alive > programmatically before commencing an import. There could be other heuristics > as well… like not running an import (which uses matching rules) unless the > zebraqueue only has X number of waiting updates. Ideally, it would be 0, but > that’s unlikely in large systems. It also is impossible if you’re importing > at a rate of more than one every 5 seconds (which would be absurdly slow). I wouldn't create a workaround for Zebra being down only for being able to match existing records... I would just print a red box saying the tool is not available. > I am curious as to whether the default Zebra update time for Debian packages > is 5 minutes or 5 seconds. While it doesn’t affect me too much as I don’t use > Debian, it matters for regular users of Koha. Ok, i'll provide the mentioned patch :-P -- Tomás Cohen Arazi Theke Solutions (http://theke.io) ✆ +54 9351 3513384 GPG: B76C 6E7C 2D80 551A C765 E225 0A27 2EA1 B2F3 C15F
_______________________________________________ Koha-devel mailing list Koha-devel@lists.koha-community.org http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/