Okay, again, give me the algorithm for software to use to figure out
what title(s) to use to display to the user (assuming we don't just want
to put out the whole 245 with ISBD punctuation and all).
How does software know when to get exactly what from a 240 or 740 , and
when to use it as a title label? (A 240 is often useless as a title
label, eg "Selections", that is NOT in fact the title of the work it's
attached to, to any user at all; A work cited in a 740 may or may not
actually be the work in the record at hand, it could also be a 'related'
work in some way).
I have studied cataloging, rather extensively, because it's interesting,
and because I want to make my software use the bibliographic data as
well as possible. I took all the cataloging classes there were in
library school. I talk to catalogers regularly. I read cataloging
journals like CCQ and cataloging listservs like this one and cataloging
blogs.
And I've spent quite a bit of time trying to figure out how to get
things like this out of our actual AACR2/MARC. It is not for lack of
trying or experience or talking to catalogers that I conclude that many
things that catalogers DO spend very expensive expert time encoding in a
record in fact can NOT be reliably or simply extracted
algorithmically. Yeah, you can come up with very complicated rules
(very complicated rules means much more expensive to implement) that
will _mostly_ work, and need to be constantly tweaked and enhanced
(again, read 'expensive', in staff time, writing software costs money in
developer time) as new examples are found that it doesn't work for. Sure.
But the original reason I started this sub-thread was to argue that THIS
is the real problem with our data, and one that MarcXML does not touch.
Now, if you believe that it is not possible or feasible to create data
describing bibliographic records that does NOT suffer from problems like
this, that bibliographic items are inherently so complicated that it is
not POSSIBLE to create data that is actually easily used by
developers... well, then we can just disagree on that. But non-library
developers are not going to be eager to use such data just because it's
in XML.
On 1/19/2011 1:28 PM, Weinheimer Jim wrote:
I guess I didn't make myself clear. When there are different titles listed in
the 245, there is *absolutely no reason whatsoever* why a computer would have
to extract those titles automatically since the cataloging rules make very
clear that they are supposed to be traced in separate 240, 246, 7xx$a$t and 740
fields (off the top of my head, probably not an exhaustive list). The parsing
has already been done and the computer work is superfluous. A cataloger knows
this, while a non-cataloger does not. There is no algorithm needed since the
cataloger has done all the work manually.
In the example in the article, I found the record the author mentioned and showed how the
cataloger had "parsed" it all manually.
Where is the problem with this? Why parse something that doesn't need it? Why
not use the power of the entirety of the records? It seems that from your post,
you are maintaining that a title in a 740 or 246 field, or in a 700$a$t field
is not usable?
James L. Weinheimer [email protected]
Director of Library and Information Services
The American University of Rome
Rome, Italy
First Thus: http://catalogingmatters.blogspot.com/
________________________________________
From: Jonathan Rochkind [[email protected]]
Sent: Wednesday, January 19, 2011 7:03 PM
To: Resource Description and Access / Resource Description and Access
Cc: Weinheimer Jim
Subject: Re: [RDA-L] Linked data
Again, as someone who "knows cataloing rules", if there's an algorithm
you can give me that will let me extract the individual elements (actual
transcribed title vs analytical titles vs parallel titles vs statement
of responsibility) reliably from correct AACR2 MARC, please let me know
what it is.
I am fairly certain there is no such algorithm that is reliable.
I guess you could say that there's no reason to _expect_ that you should
be able to get those elements out of a data record. But most
developers, library or not, will consider bibliographic data that you
can't reliably extract the title of the item (a pretty basic attribute,
just about the most basic attribute there is) from to be pretty
low-value data. They won't change their opinion if you show them the
record serialized in MarcXML instead of ISO Marc21.
All that you get by being an expert in the data is the knowledge that
you _can't_ really reliably algorithmically extract the transcribed
title alone from any arbitrary 245 of Marc/AACR2. It'll work for the
basic cases, but once you start putting in parallel titles, analytics,
and parallel titles of analytical titles, it's a big mess -- and such
complicated cases (which are rare in general but common in some domains
like music records) are also the ones where the cataloger is most likely
to have gotten the punctuation not EXACTLY right, making it even more
hopeless, even if the programmer did want to write an incredibly
complicated algorithm that tried to take into account the combination of
ISBD punctuation with marc subfields.
Yes, "many of these issues have been known from the beginning and dealt
with in various ways." That doesn't make the data easily useable by
developers, whether you put in MarcXML or not. Those "various ways", if
we're talking about software trying to extract elements from bib
records, are expensive (in developer time) and fragile (they still
won't work all the time) hacks.
On 1/19/2011 12:52 PM, Weinheimer Jim wrote:
Jonathan Rochkind wrote:
Concerning: "> "One example of this can be found reported in this article:
http://journal.code4lib.org/articles/3832"
<snip>
Okay, what would someone who "knows library metadata" do to get a
displayable title out of records in an arbitrary corpus of MARC data?
There's an easy answer that only those who know library metadata
(apparently unlike people like Thomale or me who have been working with
it for years) can provide? I have my doubts.
</snip>
I agree that this is an excellent article that everyone should read, but I
wrote a comment myself there (no. 7) discussing how this article illustrates
how important it is to know cataloging rules and/or to work closely with
experienced catalogers when building something like this. It also shows how
many programmers concentrate on certain parts of a record and tend to ignore
the overall view, while catalogers concentrate on whole records.
In this case, the parsing is *always* done manually by the cataloger, who is
directed to make title added entries, along with uniform titles, including the
authors--that is, so long as the cataloger is competent and following the
rules. So, it is always a mistake to concentrate only on a single field since a
record must be must be considered in its entirety. It would be unrealistic for
systems people to know these intricacies, but it just shows how important it is
that they work closely with catalogers.
Therefore, it's not *necessarily* arbitrary. Many of these issues have been
known since the very beginnings and have been dealt with in various ways.
James L. Weinheimer [email protected]
Director of Library and Information Services
The American University of Rome
Rome, Italy
First Thus: http://catalogingmatters.blogspot.com/