Re: [CODE4LIB] MARC magic for file
Actually, you can have records that are MARC21 coming out of vendor databases (who sometime embed control characters into the leader) and still be valid. Once you stop looking at just your ILS or OCLC, you probably wouldn't be surprised to know that records start looking very different. --TR Terry Reese, Associate Professor Gray Family Chair for Innovative Library Services 121 Valley Libraries Corvallis, Or 97331 tel: 541.737.6384 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 9:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file Can't you have a legal MARC file that does NOT have 4500 in those leader positions? It's just not legal Marc21, right? Other marc formats may specify or even allow flexibility in the things these bytes specify: * Length of the length-of-field portion * Number of characters in the starting-character-position portion of a Directory entry * Number of characters in the implementation-defined portion of a Directory entry Or, um, 23, which is I guess is left to the specific Marc implementation (ie, Marc21 is one such) to use for it's own purposes. I have no idea how that should inform the 'marc magic'. Is mime-type application/marc defined as specifically Marc21, or as any Marc? Jonathan On 4/6/2011 12:28 PM, Ford, Kevin wrote: Well, this brings us right up against the issue of files that adhere to their specifications versus forgiving applications. Think of browsers and HTML. Suffice it to say, MARC applications are quite likely to be forgiving of leader positions 20-23. In my non-conforming MARC file and in Bill's, the leader positions 20-21 (45) seemed constant, but things could fall apart for positions 22-23. So... I present the following (in-line and attached, to preserve tabs) in an attempt to straddle the two sides of this issue: applications forgiving of non- conforming files. Should the two characters following 45 (at position 20) *not* be 00, then the identification will be noted as non-conforming. We could classify this as reasonable identification but hardly ironclad (indeed, simply checking to confirm that part of the first 24 positions match the specification hardly constitutes a robust identification, but it's something). It will also give you a mimetype too, now. Would any like testing it out more fully on their own files? # # MARC 21 Magic (Third cut) # Set at position 0 0 bytex # leader position 20-21 must be 45 20 string 45 # leader starts with 5 digits, followed by codes specific to MARC format 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic !:mime application/marc 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings !:mime application/marc 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community !:mime application/marc # leader position 22-23, should be 00 but is it? 0 regex/1 (^.{21})([^0]{2}) (non-conforming) !:mime application/marc If this works, I'll see about submitting this copy. Thanks to all your efforts already. Warmly, Kevin -- Library of Congress Network Development and MARC Standards Office From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero [s...@unc.edu] Sent: Sunday, April 03, 2011 14:01 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file I am pretty sure that the marc4j standard reader ignores them; the tolerant reader definitely does. Otherwise JHU might have about two parseable records based on the mangled leaders that J-Rock gets stuck with :-) An analysis of the ~7M LC bib records from the scriblio.net data files (~ Dec 2006) indicated that leader has less than 8 bits of information in it (shannon-weaver definition). This excludes the initial length value, which is redundant given the end of record marker. The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader. The final characters of the leader are 450. Also, I object to the phrase decent MARC tool. Any tool capable of dealing with MARC as it exists cannot afford the luxury of decency :-) [ HA: A clear conscience? BW: Yes, Sir Humphrey. HA: When did you acquire this taste for luxuries?] Simon On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephenso...@ostephens.com wrote: I'm sure any decent MARC tool can deal with them, since decent MARC tools are certainly going to be forgiving enough to deal with four characters that
Re: [CODE4LIB] MARC magic for file
I'm not sure what you mean Terry. Maybe we have different understandings of valid. If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a valid Marc21 file. It violates the Marc21 specification. Now, they may still be _usable_, by software that ignores these bytes anyway or works around them. We definitely have a lot of software that does that. Which can end up causing problems that remind me of very analagous problems caused by the early days of web browsers that felt like being 'tolerant' of bad data. My html works in every web brower BUT this one, why not? Oh, becuase that's the only one that actually followed the standard, oops. I actually ran into an example of that problem with this exact issue. MOST software just ignores marc leader bytes 20-23, and assumes the semantics of 4500---the only legal semantics for Marc21. But Marc4j actually _respected_ them, apparently the author thought that some marc in the wild might intentionally set different bytes here (no idea if that's true or not). So if the leader bytes 20-23 were invalid (according to the spec), Marc47 would suddenly decide that the length of field portion was NOT 4, but actually BELIEVE whatever was in leader byte 20, causing the record to be parsed improperly. And I had records like that coming out of my ILS (not even a vendor database). That was an unfun couple days of debugging to figure out what was going on. On 4/6/2011 12:52 PM, Reese, Terry wrote: Actually, you can have records that are MARC21 coming out of vendor databases (who sometime embed control characters into the leader) and still be valid. Once you stop looking at just your ILS or OCLC, you probably wouldn't be surprised to know that records start looking very different. --TR Terry Reese, Associate Professor Gray Family Chair for Innovative Library Services 121 Valley Libraries Corvallis, Or 97331 tel: 541.737.6384 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 9:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file Can't you have a legal MARC file that does NOT have 4500 in those leader positions? It's just not legal Marc21, right? Other marc formats may specify or even allow flexibility in the things these bytes specify: * Length of the length-of-field portion * Number of characters in the starting-character-position portion of a Directory entry * Number of characters in the implementation-defined portion of a Directory entry Or, um, 23, which is I guess is left to the specific Marc implementation (ie, Marc21 is one such) to use for it's own purposes. I have no idea how that should inform the 'marc magic'. Is mime-type application/marc defined as specifically Marc21, or as any Marc? Jonathan On 4/6/2011 12:28 PM, Ford, Kevin wrote: Well, this brings us right up against the issue of files that adhere to their specifications versus forgiving applications. Think of browsers and HTML. Suffice it to say, MARC applications are quite likely to be forgiving of leader positions 20-23. In my non-conforming MARC file and in Bill's, the leader positions 20-21 (45) seemed constant, but things could fall apart for positions 22-23. So... I present the following (in-line and attached, to preserve tabs) in an attempt to straddle the two sides of this issue: applications forgiving of non- conforming files. Should the two characters following 45 (at position 20) *not* be 00, then the identification will be noted as non-conforming. We could classify this as reasonable identification but hardly ironclad (indeed, simply checking to confirm that part of the first 24 positions match the specification hardly constitutes a robust identification, but it's something). It will also give you a mimetype too, now. Would any like testing it out more fully on their own files? # # MARC 21 Magic (Third cut) # Set at position 0 0 bytex # leader position 20-21 must be 45 20 string 45 # leader starts with 5 digits, followed by codes specific to MARC format 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic !:mime application/marc 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings !:mime application/marc 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community !:mime application/marc # leader position 22-23, should be 00 but is it? 0 regex/1 (^.{21})([^0]{2}) (non-conforming) !:mime application/marc If this works, I'll see about submitting this copy. Thanks to all your efforts already. Warmly, Kevin -- Library of Congress Network Development and MARC Standards
Re: [CODE4LIB] MARC magic for file
Just as a historical note, this non-standard use of LDR/22 is likely due to OCLC's use of the character as a hexadecimal flag from back in the days when marc records were mostly schlepped around on tapes. They referred to it as the Transaction type code. When records were sent to oclc for processing, various values of the flag indicated whether a catalog card was to be produced, whether the record was an update, whether the user location symbol was to be set, etc. I'm sure others have used it for their own nefarious purposes as well. Tim Prettyman University of Michigan/LIT On Apr 6, 2011, at 12:28 PM, Ford, Kevin wrote: Well, this brings us right up against the issue of files that adhere to their specifications versus forgiving applications. Think of browsers and HTML. Suffice it to say, MARC applications are quite likely to be forgiving of leader positions 20-23. In my non-conforming MARC file and in Bill's, the leader positions 20-21 (45) seemed constant, but things could fall apart for positions 22-23. So... I present the following (in-line and attached, to preserve tabs) in an attempt to straddle the two sides of this issue: applications forgiving of non-conforming files. Should the two characters following 45 (at position 20) *not* be 00, then the identification will be noted as non-conforming. We could classify this as reasonable identification but hardly ironclad (indeed, simply checking to confirm that part of the first 24 positions match the specification hardly constitutes a robust identification, but it's something). It will also give you a mimetype too, now. Would any like testing it out more fully on their own files? # # MARC 21 Magic (Third cut) # Set at position 0 0 bytex # leader position 20-21 must be 45 20 string 45 # leader starts with 5 digits, followed by codes specific to MARC format 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic !:mimeapplication/marc 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority !:mimeapplication/marc 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings !:mimeapplication/marc 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification !:mimeapplication/marc 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community !:mimeapplication/marc # leader position 22-23, should be 00 but is it? 0 regex/1 (^.{21})([^0]{2}) (non-conforming) !:mimeapplication/marc If this works, I'll see about submitting this copy. Thanks to all your efforts already. Warmly, Kevin -- Library of Congress Network Development and MARC Standards Office From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero [s...@unc.edu] Sent: Sunday, April 03, 2011 14:01 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file I am pretty sure that the marc4j standard reader ignores them; the tolerant reader definitely does. Otherwise JHU might have about two parseable records based on the mangled leaders that J-Rock gets stuck with :-) An analysis of the ~7M LC bib records from the scriblio.net data files (~ Dec 2006) indicated that leader has less than 8 bits of information in it (shannon-weaver definition). This excludes the initial length value, which is redundant given the end of record marker. The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader. The final characters of the leader are 450. Also, I object to the phrase decent MARC tool. Any tool capable of dealing with MARC as it exists cannot afford the luxury of decency :-) [ HA: A clear conscience? BW: Yes, Sir Humphrey. HA: When did you acquire this taste for luxuries?] Simon On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens o...@ostephens.com wrote: I'm sure any decent MARC tool can deal with them, since decent MARC tools are certainly going to be forgiving enough to deal with four characters that apparently don't even really matter. You say that, but I'm pretty sure Marc4J throws errors MARC records where these characters are incorrect Owen On Fri, Apr 1, 2011 at 3:51 AM, William Denton w...@pobox.com wrote: On 28 March 2011, Ford, Kevin wrote: I couldn't get Simon's MARC 21 Magic file to work. Among other issues, I received line too long errors. But, since I've been curious about this for sometime, I figured I'd take a whack at it myself. Try this: This is very nice! Thanks. I tried it on a bunch of MARC files I have, and it recognized almost all of them. A few it didn't, so I had a closer look, and they're invalid. For example, the Internet Archive's Binghamton catalogue dump: http://ia600307.us.archive.org/6/items/marc_binghamton_univ/ $ file -m marc.magic bgm*mrc bgm_openlib_final_0-5.mrc: data
Re: [CODE4LIB] MARC magic for file
Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. For example, 22 and 23 are undefined values that local systems may very well have a practical need to define and use given that there are only so many values in the leader. This is why I sometimes see additional values in the 09 field (which should be a or blank) to define different character set types, or additional elements added to other fields. If I want to validate the content of those fields, I'd validate it through a different process -- but I separate the process from the validation of the structure -- because the two are not exclusive. --TR -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, April 06, 2011 9:59 AM To: Code for Libraries Cc: Reese, Terry Subject: Re: [CODE4LIB] MARC magic for file I'm not sure what you mean Terry. Maybe we have different understandings of valid. If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a valid Marc21 file. It violates the Marc21 specification. Now, they may still be _usable_, by software that ignores these bytes anyway or works around them. We definitely have a lot of software that does that. Which can end up causing problems that remind me of very analagous problems caused by the early days of web browsers that felt like being 'tolerant' of bad data. My html works in every web brower BUT this one, why not? Oh, becuase that's the only one that actually followed the standard, oops. I actually ran into an example of that problem with this exact issue. MOST software just ignores marc leader bytes 20-23, and assumes the semantics of 4500---the only legal semantics for Marc21. But Marc4j actually _respected_ them, apparently the author thought that some marc in the wild might intentionally set different bytes here (no idea if that's true or not). So if the leader bytes 20-23 were invalid (according to the spec), Marc47 would suddenly decide that the length of field portion was NOT 4, but actually BELIEVE whatever was in leader byte 20, causing the record to be parsed improperly. And I had records like that coming out of my ILS (not even a vendor database). That was an unfun couple days of debugging to figure out what was going on. On 4/6/2011 12:52 PM, Reese, Terry wrote: Actually, you can have records that are MARC21 coming out of vendor databases (who sometime embed control characters into the leader) and still be valid. Once you stop looking at just your ILS or OCLC, you probably wouldn't be surprised to know that records start looking very different. --TR Terry Reese, Associate Professor Gray Family Chair for Innovative Library Services 121 Valley Libraries Corvallis, Or 97331 tel: 541.737.6384 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 9:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file Can't you have a legal MARC file that does NOT have 4500 in those leader positions? It's just not legal Marc21, right? Other marc formats may specify or even allow flexibility in the things these bytes specify: * Length of the length-of-field portion * Number of characters in the starting-character-position portion of a Directory entry * Number of characters in the implementation-defined portion of a Directory entry Or, um, 23, which is I guess is left to the specific Marc implementation (ie, Marc21 is one such) to use for it's own purposes. I have no idea how that should inform the 'marc magic'. Is mime-type application/marc defined as specifically Marc21, or as any Marc? Jonathan On 4/6/2011 12:28 PM, Ford, Kevin wrote: Well, this brings us right up against the issue of files that adhere to their specifications versus forgiving applications. Think of browsers and HTML. Suffice it to say, MARC applications are quite likely to be forgiving of leader positions 20-23. In my non-conforming MARC file and in Bill's, the leader positions 20-21 (45) seemed constant, but things could fall apart for positions 22-23. So... I present the following (in-line and attached, to preserve tabs) in an attempt to straddle the two sides of this issue: applications forgiving of non- conforming files. Should the two characters following 45 (at position 20) *not* be 00, then the identification will be noted as non-conforming. We could classify this as reasonable identification but hardly ironclad
Re: [CODE4LIB] MARC magic for file
On 6 April 2011, Reese, Terry wrote: Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. What do you think is the best way to recognize MARC files (up to some level of validity, given all the MARC you've seen and parsed) that could be made to work the way magic is defined? Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
Re: [CODE4LIB] MARC magic for file
I'm honestly not family with magic. I can tell you in MarcEdit, the way that the process works is there is a very generic function that reads the structure of the data not trusting the information in the leader (since I find this data very un-reliable). Then, if users want to apply a set of rules to the validation -- I apply those as a secondary process. If you are looking to validate specific content within a record, then what you want to do in this function may be appropriate -- though you'll find some local systems will consistently fail the process. --tr From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of William Denton [w...@pobox.com] Sent: Wednesday, April 06, 2011 10:29 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file On 6 April 2011, Reese, Terry wrote: Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. What do you think is the best way to recognize MARC files (up to some level of validity, given all the MARC you've seen and parsed) that could be made to work the way magic is defined? Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
[CODE4LIB] Fwd: OAC RFP Annoncement
Forwarded: The Open Annotation Collaboration (OAC) project is pleased to announce a Request For Proposal to collaborate with OAC researchers for building implementations of the OAC data model and ontology. The OAC is seeking to collaborate with scholars and/or librarians currently using and/or curating established repositories of scholarly digital resources with well-defined audiences of scholars. The OAC intends to fund a set of four projects that are complementary in content media type and use cases that leverage the OAC Data Model to the fullest extent, and that leverage existing annotation tools or at least have articulated an interesting scholarly annotation use case. Two of the successful Respondents will collaborate with OAC researchers at the University of Maryland and the other two will collaborate with OAC research at the University of Illinois at Urbana-Champaign. (For these collaborations, Illinois and Maryland will provide guidance on the implementation of the OAC data model and ontology, help in defining extensions of the data model that might be necessary, advice on existing tools that might be adaptable for the demonstration experiment, feedback on correctness of mappings from/to native annotation formats and/or annotations created.) The full text of the RFP can be found at http://www.openannotation.org/documents/openAnnotationRFP.pdf The IP agreement attachment to this RFP is available at: http://www.openannotation.org/documents/openAnnotationIP_Agreement_forRFP.pdf A FAQ about this RFP is available at: http://www.openannotation.org/RFP_FAQs.html Please make all submissions regarding this RFP, including your letter of intent and proposal, to oac2...@support.lis.illinois.edu Questions: regarding any details of this RFP should also be emailed to oac2...@support.lis.illinois.edu; answers to substantive questions from individuals will be posted immediately on the RFP FAQ page mentioned above (so as to available to all proposers). The Open Annotation Collaboration is supported by a grant from the Andrew W. Mellon Foundation. OAC members include the University of Illinois at Urbana-Champaign, the University of Maryland, the University of Queensland (Australia), and the Los Alamos National Laboratory. Regards, Jacob Jett Assistant Coordinator, Open Annotation Collaboration Project Center for Informatics Research in Science and Scholarship The Graduate School of Library and Information Science University of Illinois at Urbana-Champaign 501 E. Daniel Street, MC-493, Champaign, IL 61820-6211 USA
Re: [CODE4LIB] MARC magic for file
.. Maybe we have different understandings of valid. If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a valid Marc21 file. It violates the Marc21 specification. Now, they may still be _usable_, by software that ignores these bytes anyway or works around them. We definitely have a lot of software that does that. Which can end up causing problems that remind me of very analagous problems caused by the early days of web browsers that felt like being 'tolerant' of bad data. My html works in every web brower BUT this one, why not? Oh, becuase that's the only one that actually followed the standard, oops. There is some question as to what value there is in validating fields that have no meaning by definition. What benefit does validating an undefined value have other than create an opportunity to break things and slow the process down just a little? The entire concept of an invalid entry in an undefined field (e.g byte 23) is oxymoronic. I'd go so far as to question the value of validating redundant data that theoretically has meaning but which are never supposed to vary. The 4 and the 5 simply repeat what is already known about the structure of the MARC record. Choking on stuff like this is like having a web browser ask you want to do with a page because it lacks a document type declaration. Garbage data is the reality, so having parsers stop when they encounter data they don't actually need unnecessarily complicates things. That kind of stuff should generate a warning at worst. kyle
Re: [CODE4LIB] MARC magic for file
Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. Wait, so is there any formal specification of validity that you can look at to determine your definition of validity, or it's just well, if I can recover it into useful data, using my own algorithms I think we computer programmers are really better-served by reserving the notion of validity for things specified by formal specifications -- as we normally do, talking about any other data format. And the only formal specifications I can find for Marc21 say that leader bytes 20-23 should be 4500. (Not true of Marc in general just Marc21). Now it may very well be (is!) true that the library community with Marc have been in the practice of tolerating working Marc that is NOT valid according to any specification. So, sure, we may need to write software to take account of that sordid history. But I think it IS a sordid history -- not having a specification to ensure validity makes it VERY hard to write any new software that recognizes what you expect it to be recognize, because what you expect it to recognize isn't formally specified anywhere. It's a problem. We shouldn't try to hide the problem in our discussions by using the word valid to mean something different than we use it for any modern data format. valid only has a meaning when you're talking about valid according to some specific specification.
Re: [CODE4LIB] MARC magic for file
On 4/6/2011 2:02 PM, Kyle Banerjee wrote: I'd go so far as to question the value of validating redundant data that theoretically has meaning but which are never supposed to vary. The 4 and the 5 simply repeat what is already known about the structure of the MARC record. Choking on stuff like this is like having a web browser ask you want to do with a page because it lacks a document type declaration. Well, the problem is when the original Marc4J author took the spec at it's word, and actually _acted upon_ the '4' and the '5', changing file semantics if they were different, and throwing an exception if it was a non-digit. This actually happened, I'm not making this up! Took me a while to debug. So do you think he got it wrong? How was he supposed to know he got it wrong, he wrote to the spec and took it at it's word. Are you SURE there aren't any Marc formats other than Marc21 out there that actually do use these bytes with their intended meaning, instead of fixing them? How was the Marc4J author supposed to be sure of that, or even guess it might be the case, and know he'd be serving users better by ignoring the spec here instead of following it? What documents instead of the actual specifications should he have been looking at to determine that he ought not to be taking those bytes at their words, but just ignoring them? To realize that we have so much non-conformant data out there that we're better off ignoring these bytes, is something you can really only learn through experience -- and something you can then later realize you're wrong on too: Ie: I _thought_ I was writing only for Marc21, but then it turns out I've got to accept records from Outer Weirdistan that are a kind of legal Marc that actually uses those bytes for their intended meaning -- better go back and fix my entire software stack, involving various proprietary and open source products from multiple sources, each of which has undocumented behavior when it comes to these bytes, maybe they follow the spec or maybe the follow Kyle's advice, but they don't tell me. This is a mess. Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN any Marc variants that actually use leader bytes 20-22 in this way -- how can I determine that? I've just got to guess and hope for the best. The point of specifications in the first place is for inter-operability, so we know that if all software and data conforms to the spec, then all software and data will interact in expected ways. Once we start guessing at which parts of the spec we really ought to be ignoring Again, I realize in the actual environment we've got, this is not a luxury we have. But it's a fault, not a benefit, to have lots of software everywhere behaving in non-compliant ways and creating invalid (according to the spec!) data.
Re: [CODE4LIB] MARC magic for file
On 6 April 2011, Jonathan Rochkind wrote: I think we computer programmers are really better-served by reserving the notion of validity for things specified by formal specifications -- as we normally do, talking about any other data format. And the only formal specifications I can find for Marc21 say that leader bytes 20-23 should be 4500. (Not true of Marc in general just Marc21). Validity does mean something definite ... but Postel's Law is a good guideline, especially with the swamp of bad MARC, old MARC, alternate MARC, that's out there. Valid MARC is valid MARC, but if---for the sake of file and its magic---we can identify technically invalid but still usable MARC, that's good. Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
Re: [CODE4LIB] MARC magic for file
On 4/6/2011 2:43 PM, William Denton wrote: Validity does mean something definite ... but Postel's Law is a good guideline, especially with the swamp of bad MARC, old MARC, alternate MARC, that's out there. Valid MARC is valid MARC, but if---for the sake of file and its magic---we can identify technically invalid but still usable MARC, that's good. Hmm, accept in the case of Web Browsers, I think general consensus is Postel's law was not helpful. These days, most people seem to think that having different browsers be tolerant of invalid data in different ways was actually harmful rather than helpful to inter-operability (which is theoretically the goal of Postel's law), and that's not what people do anymore in web browser land, at least not to the extremes they used to do it. So Postel's Law may not be a universal. Although marc data may or may not be analagous to a web browser/html. :) It doesn't _really_ matter, cause we're stuck with the legacy we're stuck with, there's no changing it now. But there are real world negative consequences to it, some of which I've tried to explain in previous messages. (And still don't call it validity if it's not please! But yes, sometimes insisting on strict validity is not the appropriate solution). Also note that assuming that byte 20-21 is 45 even when it's something else is possibly not something Postel would accept as an application of his law -- unless you document your software specifically as working only with Marc21, and not any Marc. [Postel's Law: Be conservative in what you send; be liberal in what you accept. http://en.wikipedia.org/wiki/Robustness_principle . That wiki page also notes the general category of downside in following Postel's law, which is what was encountered with HTML, and which _I've_ encountered with MARC: For example, a defective implementation that sends non-conforming messages might be used only with implementations that tolerate those deviations from the specification until, possibly several years later, it is connected with a less tolerant application that rejects its messages. In such a situation, identifying the problem is often difficult, and deploying a solution can be costly. Yes, identifying the problem and deploying the solution was costly, in my MARC case, although it definitely could have been worse. ]
Re: [CODE4LIB] MARC magic for file
Well, the problem is when the original Marc4J author took the spec at it's word, and actually _acted upon_ the '4' and the '5', changing file semantics if they were different, and throwing an exception if it was a non-digit. At least the author actually used the values rather than checking to see if a 4 or 5 were there. I still don't see what the point of looking for a 0 in an undefined field would be. I'm wondering what kind of nut job would write this into the standard, but that's not the author's problem. Do you think he got it wrong? How was he supposed to know he got it wrong, he wrote to the spec and took it at it's word. Are you SURE there aren't any Marc formats other than Marc21 out there that actually do use these bytes with their intended meaning, instead of fixing them? I wouldn't call it wrong -- the spec is a logical point of departure. MARC21 derives from an ISO standard that does not use those character positions and which otherwise requires the same data layout, but the author wouldn't necessarily know that. Standards have something in common with laws in that how they are used in the real world is as or more important than what is actually defined -- what's written and what's done in practice can be very different. Everyone here who has parsed catalog data who has done an ILS migration knows better than to just think for a second that fields can be assumed to be used as defined except for very basic stuff. How was the Marc4J author supposed to be sure of that, or even guess it might be the case, and know he'd be serving users better by ignoring the spec here instead of following it? There might not have been a good way to know. With data, one thing you always want to do is ask a bunch of people who work with it all the time about anomalies in the wild. Many great works of fiction masquerade as documents which supposedly describe reality. Ie: I _thought_ I was writing only for Marc21, but then it turns out I've got to accept records from Outer Weirdistan that are a kind of legal Marc that actually uses those bytes for their intended meaning Any such MARC as it would be noncompliant with the ISO standard from which MARC21 hails. If working from the MARC21 standard and weird records are in question, there would be a greater chance of choking on nonumeric tags as those are allowed by the ISO standard. Ignoring that MARC21 would need to be redefined to be able to take on other values, one can safely conclude that such a redefinition could only be written by totally deranged individuals. Values lower than 4 and 5 respectively would limit record length to the point little or no data could be stored, and greater values would be completely nonsensical as the MARC record length limitation would mean that the extra space allocated by the digits could only contain zeros. In any case, MARC is a legacy standard from the 60's. The chances of new flavors emerging are dismal at best. Again, I realize in the actual environment we've got, this is not a luxury we have. But it's a fault, not a benefit, to have lots of software everywhere behaving in non-compliant ways and creating invalid (according to the spec!) data. Creating is another matter entirely. Since we can control what we create ourselves, we make things a little better every time we make things comformant. However, we can't control what others do and being able to read everything is useful, including stuff created using tools/processes that aren't up to scratch. kyle
[CODE4LIB] utf8 \xC2 does not map to Unicode
Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly. -- Eric Morgan
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
Can you share the record somewhere? I suspect many of us have tools we can turn loose on it. Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 4:28 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
[CODE4LIB] **SKOS-2-HIVE: CREATING SKOS VOCABULARIES TO HELP INTERDISCIPLINARY VOCABULARY ENGINEERING**
Forwarding because I think this will be of interest to some folks on the list... -- Forwarded message -- ***SKOS-2-HIVE: CREATING SKOS VOCABULARIES TO HELP INTERDISCIPLINARY VOCABULARY ENGINEERING*** We are pleased to announce the addition of more HIVE workshops! *DATES AND LOCATIONS* *April 29, 2011 University of North Texas, Denton, Texas; Registration Deadline: April 20th *Click Here to Register for Texas Workshop http://tinyurl.com/4f39ye6 *May 20, 2011 Columbia University, New York City; Registration Deadline: May 10th* Click Here to Register for New York Workshop http://tinyurl.com/4fdode9%20 *California-based workshop date to be determined! If your institution is interested in hosting a workshop, please contact at:* hive.workshop2...@gmail.com *WORKSHOP DESCRIPTION* *SKOS-2-HIVE* workshops focus on using semantic web technologies for representing and describing collections using multiple controlled vocabularies. The workshop focuses on basic understanding and usage of W3C's Simple Knowledge Organization Systems (SKOShttp://www.w3.org/2004/02/skos/), linked data, and the HIVE library of open source applications. There are two workshop components: 1. *Foundational Concepts and HIVE Basics*. This component addresses the conceptual design of structured vocabularies, including a range of semantic relationships; domain representation and issues central to identifying useful vocabularies; the application of basic SKOS tags; and basic techniques underlying the HIVE vocabulary server for enriching digital resource descriptions. 2. I*mplementing HIVE*. This component covers more technical aspects including steps for implementing a HIVE server. Workshop outlines and learning outcomes provided further below. Workshop rationale: Semantic web technologies provide innovative means for organizing, describing, and managing digital resources in a range of formats. Successful implementation and use of semantic web technologies requires both information professionals and system developers to become knowledgeable about the underlying intellectual construct and roadmap toward forming a semantic web. The IMLS-funded Helping Interdisciplinary Vocabulary Engineering (HIVE) https://www.nescent.org/sites/hive/Main_Page project has been addressing these needs by working with the W3C's Simple Knowledge Organization Systems (SKOS) http://www.w3.org/TR/skos-reference/ in the linked data environment. HIVE has been implemented using semantic web enabling technologies and machine learning to provide a solution to the traditional controlled vocabulary problems of cost, interoperability, and usability. Current HIVE vocabulary partners include the Library of Congresshttp://www.loc.gov/index.html, the Getty Research Institute http://www.getty.edu/research/, and the U.S. Geological Survey http://www.usgs.gov/. * **WORKSHOP OUTLINE AND LEARNING OUTCOMES* *Morning Session: Foundational Concepts and HIVE Basics, 8:30 AM-12:00 PM* *Overview* This session addresses traditional thesaural concepts and the extension of these concepts via SKOS/linked data, HIVE and the semantic web. *Audience* This workshop targets information professionals (librarians, archivists, museum professional, web architects, and others); system developers; and students seeking knowledge about the basic framework and conceptual aspect of vocabulary design. *Prerequisites* Have a basic understanding of subject metadata creation or subject cataloging. *Learning Outcomes* - Evaluate controlled vocabulary, thesauri, and ontologies that would best fit your information environment's needs. - Identify basic thesaural relationships including: relative, associative and hierarchical. - Use basic SKOS tags to identify the above thesaural relationships. - Become familiar with using the HIVE software and the HIVE processes. * **Lunch on your own 12:00 PM-1:00 PM* * **Afternoon Session: Implementing HIVE 1:00 PM-4:30 PM* *Overview* This session provides details on the HIVE system, underlying algorithms, source code, and the library of system features. *Audience* System developers, as well as technologists, librarians, and information scientists who are interested in the technological side of the semantic web, and who may be implementing, experiments with, and/or extending HIVE technologies. *Prerequisites* Java programming, and object oriented design. *Learning Outcomes * - Understand the architecture of the HIVE vocabulary server. - Become familiar with information retrieval techniques and how HIVE applies them to vocabulary terms. - Gain experience indexing documents with HIVE and KEA (a machine learning application). - Learn how to integrate HIVE vocabulary services into other tools. - Learn how to use the SPARQL language for querying content in HIVE. Click here to register for Texas Workshop http://tinyurl.com/4f39ye6 Click here to register for New York Workshop http://tinyurl.com/4fdode9%20 *Registration Fees
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode Can you share the record somewhere? I suspect many of us have tools we can turn loose on it. Sure, thanks. Try: http://zoia.library.nd.edu/tmp/tor.marc -- Eric Lease Morgan
Re: [CODE4LIB] MARC magic for file
On 6 April 2011 19:53, Jonathan Rochkind rochk...@jhu.edu wrote: On 4/6/2011 2:43 PM, William Denton wrote: Validity does mean something definite ... but Postel's Law is a good guideline, especially with the swamp of bad MARC, old MARC, alternate MARC, that's out there. Valid MARC is valid MARC, but if---for the sake of file and its magic---we can identify technically invalid but still usable MARC, that's good. Hmm, accept in the case of Web Browsers, I think general consensus is Postel's law was not helpful. These days, most people seem to think that having different browsers be tolerant of invalid data in different ways was actually harmful rather than helpful to inter-operability (which is theoretically the goal of Postel's law), and that's not what people do anymore in web browser land, at least not to the extremes they used to do it. But the idea that browsers should be less permissive in what they accept is a modern one that we now have the luxury of only because adherence to Postel's law in the early days of the Web allowed it to become ubiquitous. Though it's true, as Harvey Thompson has observed that it's difficult to retro-fit correctness, Clay Shirky was also very right when he pointed out that You cannot simultaneously have mass adoption and rigor. If browsers in 1995 had been as pedantic as the browsers of 2011 (rightly) are, we wouldn't even have the Web; or if it existed at all it would just be a nichey thing that a few scientists used to make their publications available to each other. So while I agree that in the case of HTML we are right to now be moving towards more rigorous demands of what to accept (as well, of course, as being conservative in what we emit), I don't think we could have made the leap from nothing to modern rigour. -- Mike
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8. I'd guess the file isn't in UTF8. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 1:28 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
Lol! So right off the bat I see that the leader says the record is 1091 bytes long, but it is actually 1089 bytes long and I end up missing the leader for the next record. Maybe a CR/LF problem? I see that frequently as a way to mangle MARC records when moving them around. Is your problem in the very first record? Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: Wednesday, April 06, 2011 4:55 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode Can you share the record somewhere? I suspect many of us have tools we can turn loose on it. Sure, thanks. Try: http://zoia.library.nd.edu/tmp/tor.marc -- Eric Lease Morgan
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
That's hilarious, that Terry has had to do enough ugliness with Marc encodings that he indeed can recognize 0xC2 off the bat as the Marc8 encoding it represents! I am in awe, as well as sympathy. If the record is in Marc8, then you need to know if Perl Batch::Marc can handle Marc8. If it's supposed to be able to handle it, you need to figure out why it's not. (leader byte says UTF-8 even though it's really Marc8?). If Batch::Marc can't handle Marc8, you need to convert to UTF-8 first. The only software package I know of that can convert from and to Marc8 encoding is Java Marc4J, but I wouldn't be shocked if there was something in Perl to do it. (But yes, as you can tell by the name, Marc8 is a character encoding ONLY used in Marc, nobody but library people write software for dealing with it). On 4/6/2011 5:01 PM, Reese, Terry wrote: I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8. I'd guess the file isn't in UTF8. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 1:28 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file encoded in Marc8 or UTF-8? I'm betting Marc8. Theoretically there is a Marc leader byte that tells you whether it's Marc8 or UTF-8, but the leader byte is often wrong in real world records. Is it wrong? 2. Does Perl MARC::Batch have a function to convert from Marc8 to UTF-8? If so, how does it decide whether to convert? Is it trying to do that? Is it assuming that the leader byte the record accurately identifies the encoding, and if so, is the leader byte wrong? Is it trying to convert from Marc8 to UTF-8, when the source was UTF-8 in the first place? Or is it assuming the source was UTF-8 in the first place, when in fact it was Marc8? Not the answer you wanted, maybe someone else will have that. Debugging char encoding is hands down the most annoying kind of debugging I ever do. On 4/6/2011 4:13 PM, Eric Lease Morgan wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open the file correctly.
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
I'm not quite convinced that it's marc-8 just because there's \xC2 ;). If you look at a hex dump I'm seeing a lot of what might be combining characters. The leader appears to have 'a' in the field to indicate unicode. In the raw hex I'm seeing a lot of two character sequences like: 756c 69c3 83c2 a872 (culir). If I knew my utf-8 better, I could guess what combining diacritics these are. Doing a look up on http://www.fileformat.info seems to indicate that this might be utf-8, a 'DIAERESIS' When debugging any encoding issue it's always good to know a) how the records were obtained b) how have they been manipulated before you touch them (basically, how many times may they have been converted by some bungling process)? c) what encoding they claim to be now? and d) what encoding they are, if any? It's been a while since I used Marc::Batch. Is there any reason you're using that instead of just using MARC::Record? I'd try just creating a MARC::Record object. I've seen people do really bizarre things to break MARC files such as editing the raw binary, thus invalidating the leader and the directory as the byte counts were no longer right) I hate to say it, but we still come across files that are no longer in any encoding due to too many bad conversions. It's possible these are as well. The enca tool (haven't used it much) guesses this at utf-8 mixed w/ non-text data. Jon
Re: [CODE4LIB] utf8 \xC2 does not map to Unicode
On 6 April 2011, Eric Lease Morgan wrote: http://zoia.library.nd.edu/tmp/tor.marc Happily, Kevin's magic formula recognizes this as MARC! Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org