Re: [librecat-dev] A common MARC record path language

Thomas Berger Mon, 24 Feb 2014 02:20:16 -0800

Carsten,


> Thank you both for bringing the discussion forward. I must admit that I'm
> having some problems following here. I read your mails multiple times, really
> trying to understand your demands. After reading this [1], I hope I'm getting
> closer.

You also could consider to grok Jason Thomale's "Interpreting MARC: Where's the
Bibliographic Data?" < http://journal.code4lib.org/articles/3832 > (preceeding
Karen Coyle's more widely known article "MARC21 as Data: A Start"
< http://journal.code4lib.org/articles/5468 >).


> I just want to sum up what I think I've understood so far. Please correct me 
> if I'm wrong..
> 
> -- When it comes to cataloging based delimiters (punctuation), there is some
> inner semantic to the content of the subfields. E.g. "=$b" in field 245 means
> something different than ":$b".

Yes and no: In your example

787  08 $ireproduction of (manifestation) $aVerdi, Giuseppe, 1813-1901.
$tOtello.$d Milano: Ricordi, c1913

three of the four subfields have internal structure which is likely to
be exploited as in

display "$ireproduction of (manifestation)" without the text in parentheses
as the left column in a table or styled differently (introductory phrase
in italics and/or followed by a colon)

display "$aVerdi, Giuseppe, 1813-1901." as "Verdi, Giuseppe, (1813-1901)"
succeeded by a colon if $t is next

display the title $tOtello. in italics, index it somewhere

extract the place "Milano" from "$dMilano: Ricordi, c1913" before ":"

display copyright signs more nicely than "c" (applies to 787$d).


245$b is only a notorious example where a subfield does not only combine
several concepts as in 787$d but where there is no fixed "first one" and
therefore its meaning has to be deduced from punctuation information
unfortunately (but as usual) not "in" the subfield itself but immediately
preceeding it.

Furthermore the ensemble $a+$t+$d constitutes a unit (*one* citation) which
for many cases should not be torn apart.

[There's also the case of 100$c as a kind of unspecific container for
any of the several different classes of information to be injected in
the heading according to AACR2 or RDA: professions, bynames, indications
of rank etc. But there is no non-MARC markup except "," and it's almost
impossible to revere engineer $c to the factual information ("spanish
king") underlying the heading]


> -- There may be data you want to get at whole, which spread over multiple
> subfields. This information is cannot be described by the range of subfields,
> but with the closure through punctuation. E.g. in the field
> 
> 245   00$aHeritage Books archives.$pUnderwood biographical 
> dictionary.$nVolumes 1 & 2 revised$h[electronic resource] /$cLaverne 
> Galeener-Moore.
> 
> the data you want to get is
> 
> Heritage Books archives. Underwood biographical dictionary. Volumes 1 & 2 
> revised [electronic resource]

I think 245 is one of the many cases where specific information can be
/deduced/ from (MARC and ISBD) markup in the field but it would be
dangerous to state that e.g. 245$h /contains data/. It is tempting to
speak or think in terms of "subfield content", i.e. something data-like
which is implicitly terminated by the next subfield mark: The " /"
actually does not belong to $h when attempting to view it as data, it's
just an indication that the next subfield mark to follow will probably $c).
Thus 245 is in XML lingo "mixed content" with most of the prescribed
punctuation /outside/ the children data elements. As usual, also MARCspec
cannot boldly declare that the permissible results should be regarded
as "the text" or "the data" - both views are legitimate and have to be
taken into account.

To achieve the string you just gave is either trivial (prevalent AACR2 practice
with ISBD punctuation always provided in the record: "Fetch the field and
substitute $.? by a single space") or involves much magic (coming D-A-CH
practice with ISBD punctuation generally not provided: "Fetch the field,
analyze the subfield marks and enhance it with proper ISBD punctuation".
[o.k. I see: You either stripped $c from it or the content after "/" or
the specific constellation of the trailing "/" immediately preceeding
$anything or specifically $c - ISBD knows about a "parallel statement
of responsibility" like in "Our Mission / by Corporate Body A = Notre
Commande / par Corporation A" but I don't know offhand how this is coded
in AACR2+MARC for current examples]

And - as I'm not a typewriter - I rather would like to process the content of
245 with the help of the semantic clues given by the MARC encoding. Something
with only ". " as remaining delimiters is not much help. (And retrieving
more refined components like $a, $b etc. afterwards and match them to
specific parts of the combined string above seems to be very much work -
comparable to automatic tagging of OCR results)


> Is this what you mean when want to say something like "Get me all from field 
> XXX until you hit Y"? I guess so.

As I understand the purpose of MARCspec it is kind of "hit and run":

It is not a MOM (MARC Object Model) or rather an object model for
any format derived from ISO 2709 and its concepts of files, records,
(flavors of) fields and subfields and therefore no abstract API
can be specified (prescribing that some operation X is defined on
record objects and yields field objects).

MARCspec's can only be applied to records and yield an implementation
dependend something, /preferably/ this something should be a list of
some other things.

To be more specific: There may be implementations which indeed return a single
string
"Heritage Books archives. Underwood biographical dictionary. Volumes 1 & 2
revised [electronic resource]"
but I would consider these to be very special.

Other implementations might return a /string/

$aHeritage Books archives.$pUnderwood biographical dictionary.$nVolumes 1 & 2
revised$h[electronic resource] /

or a string

$aHeritage Books archives$pUnderwood biographical dictionary$nVolumes 1 & 2
revised$helectronic resource

and a third implementation could produce a list like

$aHeritage Books archives.
$pUnderwood biographical dictionary.
$nVolumes 1 & 2 revised
$h[electronic resource] /


and maybe a fourth implementation based on MARCXML and implemented within
the XML DOM would yield an (unserialized) XML fragment.



> -- Therefore the order of subfields is crucial. While MARCspec allows
> subfields stated in any order, a result should preserve the subfield order
> emerging in the field.

For "extract me /this/ subfield" there is no difference.

For "extract me those 15 subfields which might occur and I'm going to
name now" I'm not so sure: The cases above were to my impression more
like "partition me field 245 at some interesting position I provide"
or "give me anything from 783 except $i"

Furthermore the tasks of selection and extraction might (I'm speculating)
sometimes involve different (sets of) subfield tags: Select me those
6XX with either no $5 or a $5 "for me" and extract the "proper content"
(i.e. everything but control subfields but including $3?).
Or: Select any 651 with $2rswk [there is a nexus to i2=7 wich might
be disregarded?] and give me $a and either $e or $4 from the results
(Some implementations would return a list of lists? As with any
selections the primary list would be one where the members correspond
to the fields matched)



> -- Some fields are linked through specific subfields. There may be some data
> you want to get dependent on linkage from other fields. I'm not sure if I have
> an example for this. Maybe you could provide one.


cf. "Appendix A - Control Subfields" of the MARC21 documentation at
< http://www.loc.gov/marc/bibliographic/ecbdcntf.html >: I was especially
alluding to $6 and $8 which provide two MARC21 specific ways of denoting
that data in different fields comlements each other (giving the same
information in different scripts or codes kind of "tabular data" (prescribing
an order for the fields - although common practice there is neither a
rule that MARC records consist of fields sorted by label nor that
the order of fields in the record matters (i.e. transports information.
It's just so that display along increasing field labels gives a close
approximation to ISBD ordering).

I have no information how often these fields occur (AFAIK original script
cataloguing has to utilize $6) but to stumble upon some field with $6
and having to retrieve the associated content is an higher order operation
that /could/ be at least faciliated by MARCspec's if not done automagically.
An example given is

245     10$6880-03$aSosei to kako :$bNihon Sosei Kako Gakkai shi.
880     10$6245-03/$1$a[Title in Japanese script] :$b[Subtitle on Japanese 
script] .

and $6 contains the three digit field number of the associated field and
a two-digit "random number" to make this unique (there may be several
880's, each associated with at most one non-880 field). Therefore upon
seeing the 245 with $6 content "880-03" the task is:
Retrieve the (only) 880 which has (starts with) exactly $6 with content
"245-03".


Thomas

Re: [librecat-dev] A common MARC record path language

Reply via email to