RE: [librecat-dev] A common MARC record path language

2014-02-25 Thread PHILLIPS M.E.
 You also could consider to grok Jason Thomale's Interpreting MARC:
 Where's the Bibliographic Data?  http://journal.code4lib.org/articles/3832 

That's a very good article, as it highlights the problems of the prescribed 
punctuation both getting in the way of extracting parts of the data and its 
role in providing extra context to the subfields.

 It is not a MOM (MARC Object Model) or rather an object model for
 any format derived from ISO 2709 and its concepts of files, records,
 (flavors of) fields and subfields and therefore no abstract API
 can be specified (prescribing that some operation X is defined on
 record objects and yields field objects).

If we are just talking about ISO 2709, the whole family of MARC formats in 
general, then you have to remember that UNIMARC and obsolete formats like 
UKMARC have very different requirements.  UKMARC and UNIMARC are actually much 
easier to work with than MARC21 because the ISBD punctuation is not carried in 
the record but is generated from the subfield tags.  So you don't have to say 
give me the 245 $a and $b but strip / off the end if present because the 
slash is not there.  And there is a different subfield tag to introduce a 
parallel title, so you don't need to distinguish :$b from =$b.

In the UK most libraries have been MARC21 for a decade or more now.  I don't 
know how much use is still made of UNIMARC, or the other national formats, nor 
how good they were.  It seems as though in the last twenty years many countries 
have made moves towards MARC21 because of the sheer numbers of records 
available in that format.  It's just a pity that it's possibly the worst of the 
ISO 2709 formats to work with if you want to repurpose the data!

I hope that BIBFRAME is not going to make the same mistakes.  I have not been 
following that initiative in detail, but I've seen a few examples of data with 
punctuation hanging about at the end.  Hard to tell whether it's prescribed 
punctuation or copying from the book.

The title field, in particular, is much more akin to HTML markup than data 
fields in a database.  In antiquarian cataloguing rules like DCRM, the emphasis 
is on exact transcription from the title page, where the presence or absence of 
punctuation can make a difference in identifying variant editions.  In MARC21 
we get the crazy situation where the cataloguers transcribe the exact 
punctuation from the title page and *add* the ISBD punctuation to the MARC21 
record.  This makes it very hard to present the lay-person with anything 
meaningful.

Matthew

-- 
Matthew Phillips
Head of Digital and Bibliographic Services,
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941



Re: [librecat-dev] A common MARC record path language

2014-02-25 Thread Thomas Berger
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Am 25.02.2014 12:50, schrieb PHILLIPS M.E.:

 If we are just talking about ISO 2709, the whole family of MARC formats in
 general, then you have to remember that UNIMARC and obsolete formats like 
 UKMARC
 have very different requirements. UKMARC and UNIMARC are actually much easier 
 to
 work with than MARC21 because the ISBD punctuation is not carried in the 
 record
 but is generated from the subfield tags. So you don't have to say give me the
 245 $a and $b but strip / off the end if present because the slash is not
 there.

same thing with MARC21: Punctuation regime for the record is governed by Leader
pos. 18 (descriptive cataloging form which currently gives the choice between
mainly AACR2, ISBD with punctuation and ISBD without punctuation - and
not yet code(s) for RDA).

Here in Germany there is a strong tradition that cataloguers shall not enter
punctuation when the field granularity of the underlying database allows its
automatic generation for display or conversion to other formats
(what I mean is: punctuation is generated when converting from the internal
format to MARC in cases where MARC is not as granular as the internal format).

This applies to RAK data in the union databases and its transport via MAB2 or
MARC21 and it is also the intention to carry this on when switching from RAK
to RDA.

[There's also been the regulation for the D-A-CH application layer to move
punctuation which cannot be eliminated to the start of the subfield it
belongs to, e.g.

245 $a title = $b parallel title

becomes

245 $a title $b = parallel title

probably on the prospect that this could ease processing...]

viele Gruesse
Thomas Berger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iJwEAQECAAYFAlMMjZ0ACgkQYhMlmJ6W47NLLgP+KJcGwEad9zbYoUNRQer/+XBd
L39rvnWDMK6XOmW5NL+M3FQFSfArT2iJ1eyIuni92gLMfURG+z96SrKVQNEcF+IL
DVglbTE4+6OqNGf61YcwBA3x/k+MVrmqGKLqoKE7R43FgaYHKk3s7PlYaf1au9mz
z9nNz/hZDEXmujNIxJ8=
=uVi7
-END PGP SIGNATURE-


Re: [librecat-dev] A common MARC record path language

2014-02-24 Thread Thomas Berger
Carsten,


 Thank you both for bringing the discussion forward. I must admit that I'm
 having some problems following here. I read your mails multiple times, really
 trying to understand your demands. After reading this [1], I hope I'm getting
 closer.

You also could consider to grok Jason Thomale's Interpreting MARC: Where's the
Bibliographic Data?  http://journal.code4lib.org/articles/3832  (preceeding
Karen Coyle's more widely known article MARC21 as Data: A Start
 http://journal.code4lib.org/articles/5468 ).


 I just want to sum up what I think I've understood so far. Please correct me 
 if I'm wrong..
 
 -- When it comes to cataloging based delimiters (punctuation), there is some
 inner semantic to the content of the subfields. E.g. =$b in field 245 means
 something different than :$b.

Yes and no: In your example

787  08 $ireproduction of (manifestation) $aVerdi, Giuseppe, 1813-1901.
$tOtello.$d Milano: Ricordi, c1913

three of the four subfields have internal structure which is likely to
be exploited as in

display $ireproduction of (manifestation) without the text in parentheses
as the left column in a table or styled differently (introductory phrase
in italics and/or followed by a colon)

display $aVerdi, Giuseppe, 1813-1901. as Verdi, Giuseppe, (1813-1901)
succeeded by a colon if $t is next

display the title $tOtello. in italics, index it somewhere

extract the place Milano from $dMilano: Ricordi, c1913 before :

display copyright signs more nicely than c (applies to 787$d).


245$b is only a notorious example where a subfield does not only combine
several concepts as in 787$d but where there is no fixed first one and
therefore its meaning has to be deduced from punctuation information
unfortunately (but as usual) not in the subfield itself but immediately
preceeding it.

Furthermore the ensemble $a+$t+$d constitutes a unit (*one* citation) which
for many cases should not be torn apart.

[There's also the case of 100$c as a kind of unspecific container for
any of the several different classes of information to be injected in
the heading according to AACR2 or RDA: professions, bynames, indications
of rank etc. But there is no non-MARC markup except , and it's almost
impossible to revere engineer $c to the factual information (spanish
king) underlying the heading]


 -- There may be data you want to get at whole, which spread over multiple
 subfields. This information is cannot be described by the range of subfields,
 but with the closure through punctuation. E.g. in the field
 
 245   00$aHeritage Books archives.$pUnderwood biographical 
 dictionary.$nVolumes 1  2 revised$h[electronic resource] /$cLaverne 
 Galeener-Moore.
 
 the data you want to get is
 
 Heritage Books archives. Underwood biographical dictionary. Volumes 1  2 
 revised [electronic resource]

I think 245 is one of the many cases where specific information can be
/deduced/ from (MARC and ISBD) markup in the field but it would be
dangerous to state that e.g. 245$h /contains data/. It is tempting to
speak or think in terms of subfield content, i.e. something data-like
which is implicitly terminated by the next subfield mark: The  /
actually does not belong to $h when attempting to view it as data, it's
just an indication that the next subfield mark to follow will probably $c).
Thus 245 is in XML lingo mixed content with most of the prescribed
punctuation /outside/ the children data elements. As usual, also MARCspec
cannot boldly declare that the permissible results should be regarded
as the text or the data - both views are legitimate and have to be
taken into account.

To achieve the string you just gave is either trivial (prevalent AACR2 practice
with ISBD punctuation always provided in the record: Fetch the field and
substitute $.? by a single space) or involves much magic (coming D-A-CH
practice with ISBD punctuation generally not provided: Fetch the field,
analyze the subfield marks and enhance it with proper ISBD punctuation.
[o.k. I see: You either stripped $c from it or the content after / or
the specific constellation of the trailing / immediately preceeding
$anything or specifically $c - ISBD knows about a parallel statement
of responsibility like in Our Mission / by Corporate Body A = Notre
Commande / par Corporation A but I don't know offhand how this is coded
in AACR2+MARC for current examples]

And - as I'm not a typewriter - I rather would like to process the content of
245 with the help of the semantic clues given by the MARC encoding. Something
with only .  as remaining delimiters is not much help. (And retrieving
more refined components like $a, $b etc. afterwards and match them to
specific parts of the combined string above seems to be very much work -
comparable to automatic tagging of OCR results)


 Is this what you mean when want to say something like Get me all from field 
 XXX until you hit Y? I guess so.

As I understand the purpose of MARCspec it is kind of hit and run:

It is not a MOM (MARC Object 

RE: [librecat-dev] A common MARC record path language

2014-02-19 Thread Patrick Hochstenbach
].

If there is any other approach you can think of, plase make a proposal or 
give me a substantial discussion here. Otherwise I can't see any options 
solving this problem in MARCspec.

Cheers!

Carsten
___
Carsten Klee
Abt. Überregionale Bibliographische Dienste IIE
Staatsbibliothek zu Berlin - Preußischer Kulturbesitz

Fon:  +49 30 266-43 44 02

 -Ursprüngliche Nachricht-
 Von: Thomas Berger [mailto:t...@gymel.com]
 Gesendet: Mittwoch, 19. Februar 2014 01:04
 An: Klee, Carsten; 'Patrick Hochstenbach'
 Cc: v...@gbv.de; librecat-...@mail.librecat.org; perl4lib@perl.org
 Betreff: Re: [librecat-dev] A common MARC record path language

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1



 Am 18.02.2014 17:47, schrieb Klee, Carsten:

  I understand that there is MARC data combined with cataloging rules. We
  don't use this approach within our MARC. So I'm not really aware of the
 problematics.

 Your MARC however will be very much interested in / (or =) as the
 first
 character of some subfield in 245 if I recall correctly. Not such a big
 difference I would think. But maybe a slight complication of the matter,
 since MARCspec should have to cope with both approaches...

 Thomas Berger
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iJwEAQECAAYFAlMD9NYACgkQYhMlmJ6W47PzEQP/RIfm5bsHLTwhJMLJjNjF3vO/
 XIpKt98CPUgy+hcFXc4hpTi+UH8j7NIWtaCyXYOfdL4xryzI0kEk98brZ/4TJG+9
 IxzPZ8WDQL8bjX1hRTF8P4qjn/u+nyvDFFvdbM4kH7QhYhPeeWfoVqtCnMFHLzFJ
 7v+o6x2CKH2MnfOcgGI=
 =yBFy
 -END PGP SIGNATURE-