https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=10662
David Cook <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|Patch doesn't apply |In Discussion --- Comment #108 from David Cook <[email protected]> --- Thinking more about matching and how using a OAI-PMH identifier really isn't enough, especially as the identifier is unique only within the repository. You could have two separate repositories with the exact same identifier, so you need to check the OAI-PMH repository URL as well. https://www.openarchives.org/OAI/openarchivesprotocol.html#UniqueIdentifier There are a few ways of verifying that two records describe the same upstream record, but it involves some analysis. https://www.openarchives.org/OAI/2.0/guidelines-aggregator.htm#Identifiers And that analysis gets tricky when you're wanting to match against MARCXML records using Zebra. Especially since different frameworks may or may not contain the fields that you store OAI-PMH data in for the purposes of matching. -- I'm thinking of maybe making a sort of tiered search... where we search the database for OAI-PMH details... and if none are found then we use the Zebra search. However, that's problematic as it sets up some inconsistencies in methods of importing. -- To date, we think about importing only in terms of MARCXML... and that makes some sense. So with OAI-PMH, surely we can still just think of it in terms of MARCXML. Except that the harvested record isn't necessarily the same as the imported record. Although maybe it should be. Maybe instead of using OAI specific details, we should require the use of http://www.loc.gov/marc/bibliographic/bd035.html. I don't know how realistic that is though. How many organisations actually have registered MARC Organisation codes? Maybe that's the prerogative of the Koha user rather than the Koha system though. It looks like VuFind uses the MARC 001 for matching (https://github.com/vufind-org/vufind/blob/master/import/marc.properties), although that's obviously highly problematic. It has some facility for adding a prefix to the 001 for uniqueness, but that's a hack. A sample harvest of DSpace's oai_dc to VuFind's Solr indexes uses the OAI-PMH identifier for matching it seemes (https://vufind.org/wiki/indexing:dspace), but as I noted above that's also technically problematic as you may have the same identifier in multiple repositories. In theory it shouldn't happen... but it could. It looks like DSpace uses the OAI-PMH identifier (stored in the database it seems) for matching as well https://github.com/DSpace/DSpace/blob/master/dspace-api/src/main/java/org/dspace/harvest/OAIHarvester.java#L485. As noted above, this has issues if the identifier isn't unique outside the repository. DSpace has a sanity check to make sure the item hasn't been deleted before trying to do an update... I've been thinking I could match a OAI-PMH identifier to a biblionumber using a foreign key with restrict so that you can't delete a bib record without unlinking it from its OAI-PMH provenance record. So VuFind and DSpace don't have the most sophisticated of matchers and they both have problems which I'd like to avoid. -- I recall Leif suggesting that we export some data to MySQL tables (e.g. 001, 003, 020, 022, 035), but that's not without its difficulties as well, and keeps us locked into MARC as well. -- I also remember Mirko mentioning Catmandu::OAI, but it's just a layer over HTTP::OAI, and HTTP::OAI is flawed in a few ways and won't meet Stockholm University Library's requirements of having a OAI-PMH harvester that parses a XML stream. In any case... the downloading of OAI-PMH records is the easy part. It's what to do with them once we're uploading to Koha... -- For a truly robust solution, I think we'd need to overhaul Koha's importing facilities, and I'm not sure of the best way to do that. We need to be able to link X number of arbitrary identifiers with a particular Koha "record", and we need to be able to query those identifiers in a that allows for rapid matching. To be honest, this is something that Linked Data seems to be good at. You have a particular subject, and then you can link data to it arbitrarily. Then for your query, you could look for "?subject <http://koha/hasOAIPMHid> <oai:koha:5000>" or "?subject <http://koha/marc/controlNumber> '666'" Of course, ElasticSearch would work just as well, although you'd still want to save that data somewhere as a source of truth. We don't tend to use things like Zebra/Solr/ElasticSearch as the sole repository of data since we can refresh it from a source of truth. I suppose both a triplestore and a RDBMS have the same problem. You can link an identifier to a record, but what if you lose that link? You wind up with duplicates. I suppose the best you can do is try to prevent people from destroying links by accident. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
