http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662
--- Comment #50 from David Cook <[email protected]> --- (In reply to Katrin Fischer from comment #33) > I was told recently that 2-3 seconds is quite standard for OAI-PMH harvests. > Katrin, who said this to you? Andreas was also interested in every 2-3 seconds, but that doesn't seem very feasible to me. Today, I've tried downloading records, and I can download 21 records from a Swedish server in 4-5 seconds. The OAI-PMH harvester utilizes synchronous code for downloading records, so if you have multiple OAI-PMH servers, it will have to download first from Server A, then Server B, then Server C... and then it will start processing records. If each server takes 5 seconds, that's 15 seconds before you even start processing the first record. I think I might be able to find some asynchronous code for downloading records with Perl, but even then it might take 5 seconds or longer just to download records... that's longer than the ideal 2-3 seconds. Plus, the asynchronous code would require me to stop using the HTTP::OAI module and create my own asynchronous version of it... which would take some time and probably be more error-prone due to the speed which I'm trying to develop. I suppose 21 records might be a lot for a harvest running every 2-3 seconds... I just tried the query "verb=ListRecords&metadataPrefix=marcxml&from=2015-12-01T18:01:45Z&until=2015-12-01T18:01:47Z", and my browser downloaded 4 records in 1 second. I suppose it might only take another 1-2 seconds to process those 4 records and import them into Koha. That's just a guess though, as I haven't written the necessary new processing/importing code yet. I suppose if I'm sending HTTP requests asynchronously and if it only takes 1 second to fetch a handful of records, it might be doable in 2-3 seconds... but the more records to fetch from the server, the longer the download is going to take and that blows out the overall time. If 2-3 seconds is just an ideal, it might not matter if it it takes 5-10 seconds. I'm keen for feedback from potential uses of the OAI-PMH server. How much does time/frequency matter to you? This might be a case of premature optimisation. It might be better for me to focus on building a functional system, and then worry about improving the speed later. I did that recently on a different project, and it worked quite well. I focused the majority of my time on meeting the functional requirements, and then spent a few hours tweaking the code to reach massive increases in performance. However, if the system needs to be re-designed to gain those performance increases, then that seems wasteful. -- Another thought I had was to build an "import_oai" API into Koha, and then perhaps write the actual OAI-PMH harvester using a language which works asynchronously by design, like Node.js. Not that I'm excellent with Node.js. I've written code in my spare time which fetches data from a database and asynchronously updates a third-party REST API, but it's certainly not elegant... and requiring Node.js adds a layer of complexity that the Koha community itself would not want to support in any way shape or form I would think. But we could create an import API and then rely on individual libraries to supply their own OAI-PMH harvesters... although for that to work successfully, we would need standards for conversations between harvesters and importers. I'm thinking the "import_oai" or "import?type=oai" API might be a good idea in any case, although I'm not sure how Apache would cope with being hammered by an OAI-PMH harvester sending it multiple XML records every few seconds. Perhaps it's worthwhile having one daemon for downloading records, and another for importing records. Perhaps it's worth writing a forking server to handle incoming records in parallel. -- Honestly though, I would ask that people think further about the frequency of harvests. Is every 2-3 seconds really necessary? Do we really need it to be able to perform that quickly? If so, I'm open to ideas about how to achieve it. I have lots of ideas as outlined above, but more than happy to hear about suggestions, and even happier to be told not to worry about the speed. Unless people think it's a concern, I'm going to continue development with slower synchronous code. I want to make this harvester as modular as possible, so that future upgrades don't require a rewrite of the whole system. Right now, I see the bottleneck being with the downloading of records and passing those records to a processor/importer. The importer, at least for KB, is going to be difficult in terms of the logic involved, but I'm not necessarily that worried about its speed at this point. So I might try to prototype a synchronous downloader as fast as I can and spend more time on the importer and refactoring existing code. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
