RE: [librecat-dev] A common MARC record path language
You also could consider to grok Jason Thomale's Interpreting MARC: Where's the Bibliographic Data? http://journal.code4lib.org/articles/3832 That's a very good article, as it highlights the problems of the prescribed punctuation both getting in the way of extracting parts of the data and its role in providing extra context to the subfields. It is not a MOM (MARC Object Model) or rather an object model for any format derived from ISO 2709 and its concepts of files, records, (flavors of) fields and subfields and therefore no abstract API can be specified (prescribing that some operation X is defined on record objects and yields field objects). If we are just talking about ISO 2709, the whole family of MARC formats in general, then you have to remember that UNIMARC and obsolete formats like UKMARC have very different requirements. UKMARC and UNIMARC are actually much easier to work with than MARC21 because the ISBD punctuation is not carried in the record but is generated from the subfield tags. So you don't have to say give me the 245 $a and $b but strip / off the end if present because the slash is not there. And there is a different subfield tag to introduce a parallel title, so you don't need to distinguish :$b from =$b. In the UK most libraries have been MARC21 for a decade or more now. I don't know how much use is still made of UNIMARC, or the other national formats, nor how good they were. It seems as though in the last twenty years many countries have made moves towards MARC21 because of the sheer numbers of records available in that format. It's just a pity that it's possibly the worst of the ISO 2709 formats to work with if you want to repurpose the data! I hope that BIBFRAME is not going to make the same mistakes. I have not been following that initiative in detail, but I've seen a few examples of data with punctuation hanging about at the end. Hard to tell whether it's prescribed punctuation or copying from the book. The title field, in particular, is much more akin to HTML markup than data fields in a database. In antiquarian cataloguing rules like DCRM, the emphasis is on exact transcription from the title page, where the presence or absence of punctuation can make a difference in identifying variant editions. In MARC21 we get the crazy situation where the cataloguers transcribe the exact punctuation from the title page and *add* the ISBD punctuation to the MARC21 record. This makes it very hard to present the lay-person with anything meaningful. Matthew -- Matthew Phillips Head of Digital and Bibliographic Services, Durham University Library, Stockton Road, Durham, DH1 3LY +44 (0)191 334 2941
Re: [librecat-dev] A common MARC record path language
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 25.02.2014 12:50, schrieb PHILLIPS M.E.: If we are just talking about ISO 2709, the whole family of MARC formats in general, then you have to remember that UNIMARC and obsolete formats like UKMARC have very different requirements. UKMARC and UNIMARC are actually much easier to work with than MARC21 because the ISBD punctuation is not carried in the record but is generated from the subfield tags. So you don't have to say give me the 245 $a and $b but strip / off the end if present because the slash is not there. same thing with MARC21: Punctuation regime for the record is governed by Leader pos. 18 (descriptive cataloging form which currently gives the choice between mainly AACR2, ISBD with punctuation and ISBD without punctuation - and not yet code(s) for RDA). Here in Germany there is a strong tradition that cataloguers shall not enter punctuation when the field granularity of the underlying database allows its automatic generation for display or conversion to other formats (what I mean is: punctuation is generated when converting from the internal format to MARC in cases where MARC is not as granular as the internal format). This applies to RAK data in the union databases and its transport via MAB2 or MARC21 and it is also the intention to carry this on when switching from RAK to RDA. [There's also been the regulation for the D-A-CH application layer to move punctuation which cannot be eliminated to the start of the subfield it belongs to, e.g. 245 $a title = $b parallel title becomes 245 $a title $b = parallel title probably on the prospect that this could ease processing...] viele Gruesse Thomas Berger -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iJwEAQECAAYFAlMMjZ0ACgkQYhMlmJ6W47NLLgP+KJcGwEad9zbYoUNRQer/+XBd L39rvnWDMK6XOmW5NL+M3FQFSfArT2iJ1eyIuni92gLMfURG+z96SrKVQNEcF+IL DVglbTE4+6OqNGf61YcwBA3x/k+MVrmqGKLqoKE7R43FgaYHKk3s7PlYaf1au9mz z9nNz/hZDEXmujNIxJ8= =uVi7 -END PGP SIGNATURE-
Re: [librecat-dev] A common MARC record path language
Carsten, Thank you both for bringing the discussion forward. I must admit that I'm having some problems following here. I read your mails multiple times, really trying to understand your demands. After reading this [1], I hope I'm getting closer. You also could consider to grok Jason Thomale's Interpreting MARC: Where's the Bibliographic Data? http://journal.code4lib.org/articles/3832 (preceeding Karen Coyle's more widely known article MARC21 as Data: A Start http://journal.code4lib.org/articles/5468 ). I just want to sum up what I think I've understood so far. Please correct me if I'm wrong.. -- When it comes to cataloging based delimiters (punctuation), there is some inner semantic to the content of the subfields. E.g. =$b in field 245 means something different than :$b. Yes and no: In your example 787 08 $ireproduction of (manifestation) $aVerdi, Giuseppe, 1813-1901. $tOtello.$d Milano: Ricordi, c1913 three of the four subfields have internal structure which is likely to be exploited as in display $ireproduction of (manifestation) without the text in parentheses as the left column in a table or styled differently (introductory phrase in italics and/or followed by a colon) display $aVerdi, Giuseppe, 1813-1901. as Verdi, Giuseppe, (1813-1901) succeeded by a colon if $t is next display the title $tOtello. in italics, index it somewhere extract the place Milano from $dMilano: Ricordi, c1913 before : display copyright signs more nicely than c (applies to 787$d). 245$b is only a notorious example where a subfield does not only combine several concepts as in 787$d but where there is no fixed first one and therefore its meaning has to be deduced from punctuation information unfortunately (but as usual) not in the subfield itself but immediately preceeding it. Furthermore the ensemble $a+$t+$d constitutes a unit (*one* citation) which for many cases should not be torn apart. [There's also the case of 100$c as a kind of unspecific container for any of the several different classes of information to be injected in the heading according to AACR2 or RDA: professions, bynames, indications of rank etc. But there is no non-MARC markup except , and it's almost impossible to revere engineer $c to the factual information (spanish king) underlying the heading] -- There may be data you want to get at whole, which spread over multiple subfields. This information is cannot be described by the range of subfields, but with the closure through punctuation. E.g. in the field 245 00$aHeritage Books archives.$pUnderwood biographical dictionary.$nVolumes 1 2 revised$h[electronic resource] /$cLaverne Galeener-Moore. the data you want to get is Heritage Books archives. Underwood biographical dictionary. Volumes 1 2 revised [electronic resource] I think 245 is one of the many cases where specific information can be /deduced/ from (MARC and ISBD) markup in the field but it would be dangerous to state that e.g. 245$h /contains data/. It is tempting to speak or think in terms of subfield content, i.e. something data-like which is implicitly terminated by the next subfield mark: The / actually does not belong to $h when attempting to view it as data, it's just an indication that the next subfield mark to follow will probably $c). Thus 245 is in XML lingo mixed content with most of the prescribed punctuation /outside/ the children data elements. As usual, also MARCspec cannot boldly declare that the permissible results should be regarded as the text or the data - both views are legitimate and have to be taken into account. To achieve the string you just gave is either trivial (prevalent AACR2 practice with ISBD punctuation always provided in the record: Fetch the field and substitute $.? by a single space) or involves much magic (coming D-A-CH practice with ISBD punctuation generally not provided: Fetch the field, analyze the subfield marks and enhance it with proper ISBD punctuation. [o.k. I see: You either stripped $c from it or the content after / or the specific constellation of the trailing / immediately preceeding $anything or specifically $c - ISBD knows about a parallel statement of responsibility like in Our Mission / by Corporate Body A = Notre Commande / par Corporation A but I don't know offhand how this is coded in AACR2+MARC for current examples] And - as I'm not a typewriter - I rather would like to process the content of 245 with the help of the semantic clues given by the MARC encoding. Something with only . as remaining delimiters is not much help. (And retrieving more refined components like $a, $b etc. afterwards and match them to specific parts of the combined string above seems to be very much work - comparable to automatic tagging of OCR results) Is this what you mean when want to say something like Get me all from field XXX until you hit Y? I guess so. As I understand the purpose of MARCspec it is kind of hit and run: It is not a MOM (MARC Object
RE: [librecat-dev] A common MARC record path language
]. If there is any other approach you can think of, plase make a proposal or give me a substantial discussion here. Otherwise I can't see any options solving this problem in MARCspec. Cheers! Carsten ___ Carsten Klee Abt. Überregionale Bibliographische Dienste IIE Staatsbibliothek zu Berlin - Preußischer Kulturbesitz Fon: +49 30 266-43 44 02 -Ursprüngliche Nachricht- Von: Thomas Berger [mailto:t...@gymel.com] Gesendet: Mittwoch, 19. Februar 2014 01:04 An: Klee, Carsten; 'Patrick Hochstenbach' Cc: v...@gbv.de; librecat-...@mail.librecat.org; perl4lib@perl.org Betreff: Re: [librecat-dev] A common MARC record path language -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 18.02.2014 17:47, schrieb Klee, Carsten: I understand that there is MARC data combined with cataloging rules. We don't use this approach within our MARC. So I'm not really aware of the problematics. Your MARC however will be very much interested in / (or =) as the first character of some subfield in 245 if I recall correctly. Not such a big difference I would think. But maybe a slight complication of the matter, since MARCspec should have to cope with both approaches... Thomas Berger -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iJwEAQECAAYFAlMD9NYACgkQYhMlmJ6W47PzEQP/RIfm5bsHLTwhJMLJjNjF3vO/ XIpKt98CPUgy+hcFXc4hpTi+UH8j7NIWtaCyXYOfdL4xryzI0kEk98brZ/4TJG+9 IxzPZ8WDQL8bjX1hRTF8P4qjn/u+nyvDFFvdbM4kH7QhYhPeeWfoVqtCnMFHLzFJ 7v+o6x2CKH2MnfOcgGI= =yBFy -END PGP SIGNATURE-