http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=7419
--- Comment #30 from Jared Camins-Esakov <[email protected]> --- Created attachment 18714 --> http://bugs.koha-community.org/bugzilla3/attachment.cgi?id=18714&action=edit Bug 7419: General-purpose record deduplicator This patch adds a script for deduplicating records. It is most useful for authority records but by design could be easily extended for use with bibliographic records, if someone had a good use case. To test using the sample records attached to the bug (MARC21 only): 1) Apply patches. 2) Import the sample records (sampleauths.mrc) file into Koha and make sure that they get indexed (by waiting until rebuild_zebra.pl runs automatically or by running rebuild_zebra.pl -a -z [-x] manually) 3) Deduplicate based on LCCN (replace {FIRSTAUTH} with the authid of the first imported record; on my system that number is 367123668; this is not actually necessary, but if you have a lot of authority records, the process could take quite a while otherwise): > misc/migration_tools/dedup_records.pl -t -v -a \ -l "authid >= {FIRSTAUTH}" -r -m "lc-card/010a" -s date 4) Check that you have 21 duplicate records replaced from amongst the new records (you will quite possibly have more than 21 duplicate records reported, depending on whether you have any of these authorities already, but you should have at least the 21). 5) Deduplicate based on genre heading, preferring Library of Congress authorities to local authorities: > misc/migration_tools/dedup_records.pl -t -v -a \ -l "authid >= {FIRSTAUTH}" -r -m "he/155a" -s "source=DLC" -s date 6) Check that you have 3 duplicate records replaced from amongst the new records. 7) Sign off. Complete POD documentation: SYNOPSIS dedup_records.pl --match=1 -a dedup_records.pl --match="LC-card-number/010a" --select="date" \ --limit="authid > 367123592" -a dedup_records.pl --match="Match/100abcdefghijklmnopqrstuvwxyz" \ --select="source=DLC" --select="date" \ --limit="authtypecode='PERSO_NAME'" -a DESCRIPTION This script will identify duplicate records, and either suggest that you merge them (in the case of bibliographic records) or automatically merge them for you (in the case of authority records). OPTIONS --help Prints this help -v|--verbose Print verbose log information (warning: very verbose!). -t|--test Do not actually make any changes to the database, just report what changes would be made. -r|--report Print a report of what happened during the run. -l|--limit=S Only process those records that match the user-specified WHERE clause (the WHERE is implied and should not be included on the command line). -a|--authorities Check for duplicate authorities rather that duplicate bibliographic records. -s|--select=s Repeatable. Specify how to identify which record to prefer. See the section on SELECTORS below. -m|--match=s Specifies the matching rule to use. This can be the numeric ID of a matching rule that you have already configured (preferred), or you can specify a matching rule on the command-line in the following format: <index1>/<tag1><subfield1>[##<index2>/<tag2><subfield2>[##...]] Examples: at/152b##he-main/2..a##he/2..bxyzt##ident/009@ authtype/152b##he-main,ext/2..a##he,ext/2..bxyz sn,ne,st-numeric/001##authtype/152b##he-main,ext/2..a##he,ext/2..bxyz -c|--check=s Only relevant when you are using a matching rule specified on the command line. Specifies sanity checks to use to ensure that the records are really duplicate. The format is <tag1><subfields1>[,<tag2><subfields2>[,...]] Examples: 200abxyz will check subfields a,b,x,y,z of 200 fields 009@,152b will check 009 data and 152$b subfields SELECTORS This script supports a number of selectors for choosing which record is "better." score Prefer the record which is the best match based on the specified matching rule. This will probably only be useful in cases where the matching rule will not match the source record, since the source record will automatically be given a score of 2 * the matching rule threshold if it wasn't picked up by the matcher. date Prefer the record which is newer based on the 005 field. source=ABC MARC21 only. Prefer records which come from ABC based on the 003 field. usage Authorities only. Prefer the record used in the most bibliographic records. ppn UNIMARC only. Prefer records which have a PPN in the 009 field. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
