Re: [CODE4LIB] De-dup MARC Ebook records
Mike, our RecordManager (https://github.com/KDK-Alli/RecordManager) used in conjunction with VuFind does deduplication with our own algorithm. This might be of some interest to you. RecordManager works standalone, so no VuFind installation needed. For some parts it's still in active development, but the deduplication has been working pretty well so far. A short description of the algorithm is available at https://github.com/KDK-Alli/RecordManager/wiki/Deduplication, and the actual PHP code is in https://github.com/KDK-Alli/RecordManager/blob/master/classes/RecordManager.php starting at the dedupRecord function. --Ere 22.8.2013 18.07, Michael Beccaria kirjoitti: Steve, I don't think it's so much find a control field (however, the closest match I can use is ISBN or eISBN which has its issues) but also normalizing the data in the fields so that matches are produced. It will no doubt take some time to figure out. Mike Beccaria Systems Librarian Head of Digital Initiative Paul Smith's College 518.327.6376 mbecca...@paulsmiths.edu Become a friend of Paul Smith's Library on Facebook today! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of McDonald, Stephen Sent: Friday, August 16, 2013 8:16 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] De-dup MARC Ebook records Michael Beccaria said: Thanks for the replies. To clarify, I am working with 2 (or more in the future) marc records outside of the ILS. I've tried using Marcedit but my usage did vary...not much overlap with the control fields that were available to me. I have a feeling they are a bit varied. I'm also messing around with marcXimiL a little but I'm having trouble getting it to output any records at all. I also was looking at the XC aggregation module but I was having trouble getting that to work properly as well and the listserv was unresponsive. It seemed like good software but it required me to set up an OAI harvest source to allow it to ingest the records and that...well...enough is enough... I think I will probably need to write something, and at least that way I know what it will be doing rather than plowing through software that has little to no support. Please feel free to let me know of a particular strategy you think might work best in this regard... If you couldn't get adequate deduping from the control fields available in MarcEdit deduping, what control fields do you think you need to dedup on? You can actually specify any arbitrary field and subfield for deduping in MarcEdit. Steve McDonald steve.mcdon...@tufts.edu -- Ere Maijala Kansalliskirjasto / The National Library of Finland
Re: [CODE4LIB] Simple Web-based Dublin Core search engine?
Oops, should have read the other replies first. :) --Ere On 23.3.2011 17:05, Ere Maijala wrote: Hi Edward, I haven't actually used it for searching apart from a quick test, but PKP Harvester2 (pkp.sfu.ca/harvester/) might fit your needs. It's LAMP, open source and workable. We're building an OAI-PMH aggregator on top of it. --Ere On 16.3.2011 17:00, Edward M. Corrado wrote: Hi, I [will soon] have a small set ( 1000 records) of Dublin Core metadata published in OAI_DC format that I want to be searchable via a Web browser. Normally we would use Ex Libris's Primo for this, but this particular set of data may have some confidential information and our repository only has minimal built in search functions. While we still may go with Primo for these records, I am looking for at other possibilities. The requirements as I see them are: 1) Can ingest records in OAI_DC format 2) Allow remote end-users who are familiar with the collection search these ingest records via a Web browser. 3)Search should be keyword anywhere or individual fields although it does not need to have every whizzbang feature out there. In other words, basic search feature are fine. 4) Should support the ability to link to the display copy in our repository (probably goes without saying) 5) Should be simple to install and maintain (Thus, at least in my mind, eliminating something like Blacklight) 6) Preferably a LAMP application although a Windows server based solution is a possibility as well 7) Preferably Open Source, or at least no- or low-cost I haven't been able to find anything searching the Web, but it seems like something people may have done before. Before I re-invent the wheel or shoe-horn something together, does anyone have any suggestions? Edward -- Ere Maijala Kansalliskirjasto
Re: [CODE4LIB] Simple Web-based Dublin Core search engine?
Hi Edward, I haven't actually used it for searching apart from a quick test, but PKP Harvester2 (pkp.sfu.ca/harvester/) might fit your needs. It's LAMP, open source and workable. We're building an OAI-PMH aggregator on top of it. --Ere On 16.3.2011 17:00, Edward M. Corrado wrote: Hi, I [will soon] have a small set ( 1000 records) of Dublin Core metadata published in OAI_DC format that I want to be searchable via a Web browser. Normally we would use Ex Libris's Primo for this, but this particular set of data may have some confidential information and our repository only has minimal built in search functions. While we still may go with Primo for these records, I am looking for at other possibilities. The requirements as I see them are: 1) Can ingest records in OAI_DC format 2) Allow remote end-users who are familiar with the collection search these ingest records via a Web browser. 3)Search should be keyword anywhere or individual fields although it does not need to have every whizzbang feature out there. In other words, basic search feature are fine. 4) Should support the ability to link to the display copy in our repository (probably goes without saying) 5) Should be simple to install and maintain (Thus, at least in my mind, eliminating something like Blacklight) 6) Preferably a LAMP application although a Windows server based solution is a possibility as well 7) Preferably Open Source, or at least no- or low-cost I haven't been able to find anything searching the Web, but it seems like something people may have done before. Before I re-invent the wheel or shoe-horn something together, does anyone have any suggestions? Edward -- Ere Maijala (Mr.) The National Library of Finland
Re: [CODE4LIB] unwanted (bogus) characters in marc
On 7.10.2010 15:17, Thomas Krichel wrote: Ere Maijala writes # Fix non-UTF-8 characters with two highest bits set (we assume they are actually ISO-8859-1) What about use Encode::Guess qw/latin-1/; $decoded=decode(Guess, $dodgy_input); $decoded then should be a utf-8 string with utf8 flag on. Would that work for a predominantly proper utf-8 input with some mistakes thrown in? --Ere
Re: [CODE4LIB] unwanted (bogus) characters in marc
In Perl, something like this might do the trick: # Fix non-UTF-8 characters with two highest bits set (we assume they are actually ISO-8859-1) # Rule: there can't be a single byte with the high bits set followed by a byte in range 00-7F or C0-FF $str =~ s/([\xC0-\xFF])(?=[\x00-\x7f\xC0-\xFF])/chr(0xC0 + (ord($1) 6)) . chr(0x80 + (ord($1) 0x3F))/seg; No wrapping there to keep it single-line. :) --Ere On 7.10.2010 14:56, Cowles, Esme wrote: Eric- I don't know the original source of those MARC files, but I've worked with files from an III system where diacritics had to be entered as character code escapes like Muse{226}e du Louvre (where 226 is the ANSEL code for a combining acute accent). So if somebody made a typo and entered something like Muse{22}6e du Louvre instead, you'd get some bogus invalid character. I was working with MARCXML files in Java, so I wrote a FilterReader class that removed any characters that were invalid in UTF-8 XML. I assume you could do something similar in Perl (probably with a fancy one-line regex). -Esme -- Esme Cowlesescow...@ucsd.edu We've all heard that a million monkeys banging on a million typewriters will eventually reproduce the works of Shakespeare. Now, thanks to the Internet, we know this is not true. -- Robert Wilensky On Oct 7, 2010, at 6:51 AM, Eric Lease Morgan wrote: How do I trap for unwanted (bogus) characters in MARC records? I have a set of Internet Archive identifiers, and have written the followoing Perl loop to get the MARC records associated with each one: # process each identifier my $ua = LWP::UserAgent-new( agent = AGENT ); while (DATA ) { # get the identifier chop; my $identifier = $_; print $identifier, \n; # get its corresponding MARC record my $response = $ua-get( ROOT . $identifier/$identifier . _meta.mrc ); if ( ! $response-is_success ) { warn $response-status_line; next; } # save it open MARC, $identifier.mrc or die Can't open $identifier.mrc: $!\n; binmode MARC, :utf8; print MARC $response-content; close MARC; } I then use the venerable marcdump to see the fruits of my labors: marcdump *.mrc. Unfortunately, marcdump returns the following error against (at least) one of my files: bienfaitsducatho00pina.mrc utf8 \xC3 does not map to Unicode at /System/Library/ Perl/5.10.0/darwin-thread-multi-2level/Encode.pm line 162. What is going on here? Am I saving my files incorrectly? Is the original MARC data inherintly incorrect? Is there some way I can fix the MARC record in question? -- Eric Lease Morgan -- Ere Maijala Kansalliskirjasto
Re: [CODE4LIB] SRU indexes for Aleph
Here you are, enjoy! Note that there is one peculiarity in this response. The bath profile actually doesn't define isbn index, but it exists in the default explain response of YAZ Proxy and is replicated here. --Ere Quoting Ziso, Ya'aqov z...@rowan.edu: Ere, thanks. Can you share the response you get for your explain query then? ./Ya’aqov On 6/10/10 11:04 AM, Ere Maijala ere.maij...@helsinki.fi wrote: No, we have a perfectly working setup, explain included. --Ere Quoting Ziso, Ya'aqov z...@rowan.edu: Ere, So far Corey (NYU, given their no-implementation-yet) reported he gets a 404 for an explain query. Are you configured differently and ALSO get a 404? Ya'aqov From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Ere Maijala [ere.maij...@helsinki.fi] Sent: Thursday, June 10, 2010 10:10 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] SRU indexes for Aleph Quoting LeVan,Ralph le...@oclc.org: Something's not right with this picture. The YAZ Proxy IS the SRU server, so it should be delivering up the Explain record. If it has a configuration file that defines the mapping from CQL indexes to z39.50 indexes, then it has all the information it needs to populate the indexInfo part of the Explain record. So, why no Explain record? The explain record is a static xml fragment, a part of the configuration XML. It is separate from the pqf mapping file. In the default Aleph config it's completely missing, although the YAZ Proxy distribution does have it. --Ere -- Ere Maijala (Mr.) Kansalliskirjasto / The National Library of Finland -- Ere Maijala (Mr.) Kansalliskirjasto / The National Library of Finland ?xml version=1.0? zs:explainResponse xmlns:zs=http://www.loc.gov/zing/srw/;zs:version1.1/zs:versionzs:recordzs:recordSchemahttp://explain.z3950.org/dtd/2.0//zs:recordSchemazs:recordPackingxml/zs:recordPackingzs:recordDataexplain xmlns=http://explain.z3950.org/dtd/2.0/; serverInfo hostlinda.linneanet.fi/host port210/port databasefin01/database /serverInfo databaseInfo titleLINDA/title description lang=en primary=true SRU/Z39.50 Gateway to Union Catalog LINDA /description /databaseInfo indexInfo set identifier=info:srw/cql-context-set/1/cql-v1.1 name=cql/ set identifier=info:srw/cql-context-set/1/dc-v1.1 name=dc/ set identifier=http://zing.z3950.org/cql/bath/2.0/; name=bath/ index id=12 titleid/title mapname set=recid/name/map /index index id=4 titletitle/title mapname set=dctitle/name/map /index index id=21 titlesubject/title mapname set=dcsubject/name/map /index index id=30 titledate/title mapname set=dcdate/name/map /index index id=62 titledescription/title mapname set=dcdescription/name/map /index index id=1003 titlecreator/title mapname set=dccreator/name/map mapname set=dcauthor/name/map /index index id=1007 titleidentifier/title mapname set=dcidentifier/name/map /index index id=1018 titlepublisher/title mapname set=dcpublisher/name/map /index index id=1020 titleeditor/title mapname set=dceditor/name/map /index index id=7 titleisbn/title mapname set=bathisbn/name/map /index index id=8 titleissn/title mapname set=bathissn/name/map /index index id=1002 titlename/title mapname set=bathname/name/map /index /indexInfo schemaInfo schema identifier=info:srw/schema/1/marcxml-v1.1 sort=false name=marcxml titleMARCXML/title /schema schema identifier=info:srw/schema/1/dc-v1.1 sort=false name=dc titleDublin Core/title /schema schema identifier=http://www.loc.gov/mods; sort=false name=mods2 titleMODS v2/title /schema schema identifier=info:srw/schema/1/mods-v3.0 sort=false name=mods titleMODS v3/title /schema /schemaInfo configInfo default type=numberOfRecords0/default /configInfo /explain/zs:recordData/zs:record/zs:explainResponse
Re: [CODE4LIB] Zotero, unapi, and formats?
This was the only information I found when I developed unAPI support for our MetaLib installation: http://forums.zotero.org/discussion/1229/unapi-support/. Based on my experimentation and looking at the code, if my memory serves: 1. At least the formats mentioned in the forum post. I believe it uses the docs attribute to distinguish formats, as type can be e.g. application/xml for multiple formats. 2. Weird, but I don't remember how. I ended up providing only MARCXML, DC and RIS, because it chose MODS over MARCXML if it was available and did something that sucked. Things may have changed, this was in 2008. 3. Didn't test this one, we only provide a single record at a time. 4. It chose COinS over unAPI at least at the time, and I found that to be a bit problematic. 5. Dunno. --Ere On 6.4.2010 16:48, Jonathan Rochkind wrote: Anyone know if there's any developer documentation for Zotero on it's use of unAPI? Alternately, anyone know where I can find the answers to these questions, or know the answers to these questions themselves? 1. What formats will Zotero use via unAPI. What mime content-types does it use to recognize those formats (sometimes a format has several in use, or no official content-type). 2. What is Zotero's order of preference when multiple formats via unAPI are available? 3. Will Zotero get confused if different documents on the page have different formats available? This can be described with unAPI, but it seems atypical, so not sure if it will confuse Zotero. 4. If both unAPI and COinS are on a given page -- will Zotero use both (resulting in possible double-import for citations exposed both ways). Or only one? Or depends on how you set up the HTML? 5. Somewhere that now I can't find I saw a mention of a Zotero RDF format that Zotero would consume via unAPI. Is there any documentation of this format/vocabulary, how can I find out how to write it?
Re: [CODE4LIB] List of MARC flavors
Also worth taking a look at is the Z39.50 OID list: http://www.loc.gov/z3950/agency/defns/oids.html --Ere On 23.3.2010 20:50, Houghton,Andrew wrote: Does anyone know where there might be a list of the various flavors of MARC? I currently have: marc21 usmarc US MARC Replaced by marc21 rusmarc Russian MARC canmarc Canadian MARC Replaced by marc21 ukmarc UK MARC Replaced by marc21 cmarc Chinese MARC unimarc Uni-MARC -- Ere Maijala (Mr.) IT Research Specialist The National Library of Finland
Re: [CODE4LIB] Q: XML2JSON converter [MARC-JSON]
On 03/15/2010 06:22 PM, Houghton,Andrew wrote: Secondly, Bill's specification looses semantics from ISO 2709, as I previously pointed out. His specification clumps control and data fields into one property named fields. According to ISO 2709, control and data fields have different semantics. You could have a control field tagged as 001 and a data field tagged as 001 which have different semantics. MARC-21 has imposed certain rules for I won't comment on Bill's proposal, but I'll just say that I don't think you can have a control field and a data field with the same code in a single MARC format. Well, technically it's possible, but in practice everything I've seen relies on rules of the MARC format at hand. You could actually say that ISO 2709 works more like Bill's JSON, and MARCXML is the different one, as in ISO 2709 the directory doesn't separate control and data fields. --Ere -- Ere Maijala (Mr.) The National Library of Finland
Re: [CODE4LIB] HTML mark-up in MARC records
Jonathan Rochkind wrote: Ere Maijala wrote: That shouldn't be a problem as any sane OAI-PMH provider, unAPI or ATOM serializer would escape the contents. Things that resemble HTML tags could be present in MARC records without any HTML-in-MARC too. Sure, and then, if you have html tags in your marc, that system doing the re-use is going to present content to users with escaped HTML in it, which isn't desirable either! How the content is stored in the transport format is separate from how it is used. Whatever the re-using system does is not related to how the data was transferred to it. If it extracts the stuff from the XML, it will of course unescape the content, but what happens after that is up to the system and unrelated to the transport mechanism. So here is an example of the whole process: MARC with embedded HTML - OAI-PMH provider escapes the MARC in some XML format - OAI-PMH harvester (the re-using system) unescapes the data from the XML format - Something is done with the data It's the same as if the source system stores the data internally in MARCXML. The content must be escaped so that it can be stored in MARCXML and doesn't mess up the markup, but when the uses the data e.g. for display, it's first retrieved from XML and unescaped, and massaged to the desired display format only after that. If you use DOM to do the XML manipulation, all this will happen automatically. You just write and read strings and DOM manipulation takes care of escaping and unescaping. You could substitute XML with e.g. Base64 encoding if it makes thinking about this stuff easier. For instance email clients often send binary files in Base64, but it doesn't mean the file is ruined, as the receiving email client can decode it back to the original binary. --Ere
Re: [CODE4LIB] COinS in OL?
Gabriel Farrell wrote: COinS would be great, but unAPI would be useful also. In the case of Zotero, for example, more information can be passed along with unAPI than with COinS. I agree. They are not mutually exclusive and can be used for quite different purposes. In my experience COinS is great for getting the data for OpenURL linking, but unAPI is much better for saving records and gives the client the power to decide what information it needs. --Ere -- Ere Maijala (Mr.) IT Research Specialist The National Library of Finland
Re: [CODE4LIB] refworks developer documentation?
Jonathan, Official documentation: http://www.refworks.com/DirectExport.htm Sample code from our MetaLib implementation at http://wiki.helsinki.fi/display/Nelli/Direct+Export+to+RefWorks+from+MetaLib and Voyager at http://www.nationallibrary.fi/libraries/linnea/pwebrecon2.html. --Ere Jonathan Rochkind wrote: Does anyone know where, if anywhere, I find documentation on the ways to send references to RefWorks for importing? Not having any luck on their website. I know I've seen it before though. I remember there were a variety of formats and methods you could send things to RefWorks for an import. Must be documentation somewhere? I bet some code4libber has done this before. Jonathan -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu -- Ere Maijala (Mr.) IT Research Specialist The National Library of Finland P.O.Box 26 (Teollisuuskatu 23) FI-00014 University of Helsinki FINLAND ere.maijala(@)helsinki.fi Tel. +358 9 191 44260