https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662
--- Comment #290 from David Cook <[email protected]> --- I've looked more at Bug 22417, and it's got me thinking. The OAI-PMH harvester has a few core needs: 1. Instant communication between Web UI and OAI-PMH harvester to coordinate harvesting tasks 2. Ability to schedule tasks 3. Ability to execute and repeat download tasks in parallel in the background 4. Ability to save downloaded records 5. Ability to import records into Koha The first is achieved by exchanging JSON messages over a Unix socket. (I'd actually like to change this to a TCP socket and use JSON messages over a HTTP API. Using HTTP would make it easier to communicate over a network, simplify the client code using standard communications mechanisms, and has authentication methods which could help to secure the OAI-PMH harvester. I could also provide a Docker image that contains the OAI-PMH harvester in a separate Docker container from the Koha application server.) This works fairly well and is easy enough to achieve. The second is provided by a bespoke scheduler using POE timers built into the OAI-PMH harvester. Given the granularity of the scheduling, I don't see this changing any time soon. In theory, a generic Koha task scheduler could replace this functionality, but that seems unlikely any time soon. This is arguably one of the most complex parts of the OAI-PMH harvester. Currently, the OAI-PMH uses an in-memory queue for download tasks, and a database queue for import tasks. I thought a lot about how RabbitMQ might be used to replace these queues. It could be useful to replace the in-memory download queue, and the download workers could be split out of the existing OAI-PMH harvester. As for the import tasks, the download workers need to save the records and enqueue an import task ASAP. At the moment, they save the records to disk, and add an import task to the database with a pointer to the records on disk. It works well enough, but it assumes that you have the disk space, and that the import worker has access to the local disk. I've been thinking it would be better to either 1) save the records to the database and enqueue a RabbitMQ message with a pointer to the database, or 2) send the records in a RabbitMQ message. I think the 1st option is probably better, because there is increased visibility. You can't see all the messages in a RabbitMQ queue. That all said, saving to disk is going to be faster than sending the data over a network. (However, in the past I think that I've sent individual records over the network. In this case, I'd be sending whole batches of records at once.) But for a download worker to send records, it would need credentials for either the database or RabbitMQ... so I'm thinking that perhaps it would be better to use an import API with an API key, although that would involve receiving all the data over the network and then sending it to the database. Also slow. But haven't tested the actual speeds. The import API would save the data to the database, and then enqueue an import task. Of course, this would just be a re-working of what's already here. The benefits of re-working the queues and workers are arguable at this point, although there is certainly benefits of changing from a bespoke client/server communication protocol to HTTP over TCP. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
