Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Bill Dueber
2011/4/8 Karen Miller 

> I hope I'm not pointing out the obvious,


That made me laugh so hard I almost ruptured something.

Thank you so much for such a complete (please, god, tell me it's
complete...) explanation. It's a little depressing, but at least now I now
why I'm depressed :-)


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Karen Miller
OK, as a cataloger who has been confused by the jurisdictional/place name
distinction, I'm going to jump in here. 

Whether "England" means the free-floating geographic entity or the country
is not quite unknowable -- it depends on the MARC codes that accompany it. 

The brief answer is this: a field used in a 651$a or a $z should match a 151
in the LC authorities.

If the MARC field is 151 or 651 (let's just say x51), then the $a should
match a 151 in the authority file.
MARC subfield z ($z) is always a geographic subdivision and should match a
151.

Here's where it gets tricky: 
If the MARC field is a x10 (110, 610, 710 – corporate bodies), then the $a
should match a 110 or a 151 in the authority file. If the first indicator of
such a MARC field is a 1, then it will probably match a 151 – first
indicator "1" means that a heading is jurisdictional and may match a 151.

For  example:

110 1_ United States. ‡b Dept. of Agriculture

There is a 

151 United States 

in the LC authorities, but no 

110 United States

yet it can be used as a corporate body name in a bib. record with a 110
field. 

This is further confused by the VIAF, in which some national libraries have
established the United States as a corporate body (110).

At the risk of confusing things, I'd suggest looking at countries like the
United States, Kenya or Canada as examples. England is not a great example
because it's not a current jurisdiction name - there is a note in the LC
authority record that reads "Heading for England valid as a jurisdiction
before 1536 only. Use "(England)" as qualifier for places (23.4D) and for
nongovernment bodies (24.4C2)." It is established as a 110 because it *used
to be* a jurisdiction name and would be valid for works issued by the
government prior to 1536. Obviously this note is of no use to a machine, but
it explains why we aren't seeing it used as a jurisdiction (a corporate
body) with subordinate bodies.

I hope I'm not pointing out the obvious, but the use of names that appear in
151 fields in the authority file as 110 fields in bibliographic records
confused me for a very long time; our authorities librarian explained it to
me at least twice before the proverbial light bulb went on for me. 

Karen

Karen D. Miller
Monographic/Digital Projects Cataloger
Bibliographic Services Dept.
Northwestern University Library
Evanston, IL 
k-mill...@northwestern.edu
847-467-3462


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Bill
Dueber
Sent: Friday, April 08, 2011 1:40 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] LCSH and Linked Data

On Fri, Apr 8, 2011 at 1:50 PM, Shirley Lincicum  wrote:

> Ross is essentially correct. Education is an authorized subject term
> that can be subdivided geographically. Finance is a free-floating
> subdivision that is authorized for use under subject terms that
> conform to parameters given in the scope notes in its authority record
> (680 fields), but it cannot be subdivided geographically. England is
> an authorized geographic subject term that can be added to any heading
> that can be subdivided geographically.


Wait, so is it possible to know if "England" means the free-floating
geographic entity or the country? Or is that just plain unknowable.

Suddenly, my mouth is hungering for something gun-flavored.

I know OCLC did some work trying to dis-integrate different types of terms
with the FAST stuff, but it's not clear to me how I can leverage that (or
anything else) to make LCSH at all useful as a search target or (even
better) facet.  Has anyone done anything with it?


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Bill Dueber
On Fri, Apr 8, 2011 at 1:50 PM, Shirley Lincicum  wrote:

> Ross is essentially correct. Education is an authorized subject term
> that can be subdivided geographically. Finance is a free-floating
> subdivision that is authorized for use under subject terms that
> conform to parameters given in the scope notes in its authority record
> (680 fields), but it cannot be subdivided geographically. England is
> an authorized geographic subject term that can be added to any heading
> that can be subdivided geographically.


Wait, so is it possible to know if "England" means the free-floating
geographic entity or the country? Or is that just plain unknowable.

Suddenly, my mouth is hungering for something gun-flavored.

I know OCLC did some work trying to dis-integrate different types of terms
with the FAST stuff, but it's not clear to me how I can leverage that (or
anything else) to make LCSH at all useful as a search target or (even
better) facet.  Has anyone done anything with it?


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Shirley Lincicum
I'm a cataloger who has been following this discussion with interest,
but not necessarily understanding all of it. I'll try to add what I
can regarding the rules for constructing LCSH headings.

> My understanding is that Education--England--Finance *is* authorized,
> because Education--Finance is and England is a free-floating
> geographic subdivision.  Because it's also an authorized heading,
> "Education--England--Finance" is, in fact, an authority.  The problem
> is that free-floating subdivisions cause an almost infinite number of
> permutations, so there aren't LCCNs issued for them.

Ross is essentially correct. Education is an authorized subject term
that can be subdivided geographically. Finance is a free-floating
subdivision that is authorized for use under subject terms that
conform to parameters given in the scope notes in its authority record
(680 fields), but it cannot be subdivided geographically. England is
an authorized geographic subject term that can be added to any heading
that can be subdivided geographically. Thus, Education -- England --
Finance is a valid LCSH heading, whereas Education -- Finance --
England would not be. This is wonky, and it's stuff like this that
makes LCSH so unwieldy and difficult to validate, even for humans who
actually have the capacity to learn and adjust to all of the various
inconsistencies.

I don't know how relevant it is to this particular discussion, but
going forward I'm not sure how important it is to validate LCSH
headings. I really appreciate developers who seek to preserve the
semantic relationships present in the headings as much as possible; I
believe many of them have value. But aren't there ways to
preserve/extract that value without getting too bogged down in the
inconsistent left-to-right structure of the existing headings?

I hope this helps, at least a little bit. I'd be happy to answer
additional questions.

Shirley

Shirley Lincicum
Frustrated Cataloger


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Owen Stephens
Thanks Ross - I have been pushing some cataloguing folk to comment on some
of this as well (and have some feedback) - but I take the point that wider
consultation via autocat could be a good idea. (for some reason this makes
me slightly nervous!)s

In terms of whether Education--England--Finance is authorised or not - I
think I took from Andy's response that it wasn't, but also looking at it on
authorities.loc.gov it isn't marked as 'authorised'. Anyway - the relevant
thing for me at this stage is that I won't find a match via id.loc.gov - so
I can't get a URI for it anyway.

There are clearly quite a few issues with interacting with LCSH as Linked
Data at the moment - I'm not that keen on how this currently works, and my
reaction to the MADS/RDF ontology is similar to that of Bruce D'Arcus (see
http://metadata.posterous.com/lcs-madsrdf-ontology-and-the-future-of-the-se),
but on the otherhand I want to embrace the opportunity to start joining some
stuff up and seeing what happens :)

Owen

On Fri, Apr 8, 2011 at 3:10 PM, Ross Singer  wrote:

> On Fri, Apr 8, 2011 at 5:02 AM, Owen Stephens  wrote:
>
> > Then obviously I lose the context of the full heading - so I also want to
> > look for
> > Education--England--Finance (which I won't find on id.loc.gov as not
> > authorised)
> >
> > At this point I could stop, but my feeling is that it is useful to also
> look
> > for other combinations of the terms:
> >
> > Education--England (not authorised)
> > Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008
> )
> >
> > My theory is that as long as I stick to combinations that start with a
> > topical term I'm not going to make startlingly inaccurate statements?
>
> I would definitely ask this question somewhere other than Code4lib
> (autocat, maybe?), since I think the answer is more complicated than
> this (although they could validate/invalidate your assumption about
> whether or not this approach would get you "close enough").
>
> My understanding is that Education--England--Finance *is* authorized,
> because Education--Finance is and England is a free-floating
> geographic subdivision.  Because it's also an authorized heading,
> "Education--England--Finance" is, in fact, an authority.  The problem
> is that free-floating subdivisions cause an almost infinite number of
> permutations, so there aren't LCCNs issued for them.
>
> This is where things get super-wonky.  It's also the reason I
> initially created lcsubjects.org, specifically to give these (and,
> ideally, locally controlled subject headings) a publishing
> platform/centralized repository, but it quickly grew to be more than
> "just a side project".  There were issues of how the data would be
> constructed (esp. since, at the time, I had no access to the NAF), how
> to reconcile changes, provenance, etc.  Add to the fact that 2 years
> ago, there wasn't much linked library data going on, it was really
> hard to justify the effort.
>
> But, yeah, it would be worth running your ideas by a few catalogers to
> see what they think.
>
> -Ross.
>



-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Bill Dueber
On Fri, Apr 8, 2011 at 10:10 AM, Ross Singer  wrote:

> But, yeah, it would be worth running your ideas by a few catalogers to
> see what they think.
>


And if anyone does this...please please *please* write it up!

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


[CODE4LIB] LCSH and Linked Data / Ross

2011-04-08 Thread Ya'aqov Ziso
*Hi and thank you Ross, Jonathan, and Andy,

I do wish someone from LC would answer Jonathan's questions for all codes
and geographic subdivision or subject implications. There's so much
self-inflicted pain I can go through trying to revive my cataloging days.
Here are some clarifications though:

List of Geographic Areas is the macro list, whereby List of countries
includes only countries as a subset from the macro list.

MARC Code List for Countries [choice of a MARC code is generally related to
information in field 260 (Publication, Distribution, etc. (Imprint)).  The
code recorded in 008/15-17 is used in conjunction with field 044 (Country of
Producer Code) when more than one code is appropriate to an item.]

MARC Geographic Area Codes are codes entered (according to geographic names
in the 6xx fields) in field 043.*
*
*
*The Country Codes and Geographic Area Codes are entered bureaucratically,
bypassing Jonathan's refined distinctions. These tasks are outsourced to
agencies separate from the catalogers assigning LCSH.*
*
*
*Now it starts getting uglier, since upkeep for these lists differs in time
and agency. Possibly new territory names are done now by NATO ... You would
expect to see the same name in a code list and in a geographic name (151) .
Sometimes you won't. Sometimes you'll see redundancies which confuse even
more.

So since:*

   1. *LCSH has mistakes, inconsistencies*
   2. *LC doesn't talk to CODE4LIB to answer our questions*
   3. *OCLC will not talk to LC on our behalf*

*we can create the geographic name list(s) we need. Since we know that 6xx
forms for geographic names appear in 151 and 781 fields, we can create an
index for those names for matching to 6xx in LCSH. Andrew, please
complete/comment-on this list.*
*
*
*Ya'aqov*
*
*
*



*


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Ross Singer
On Fri, Apr 8, 2011 at 5:02 AM, Owen Stephens  wrote:

> Then obviously I lose the context of the full heading - so I also want to
> look for
> Education--England--Finance (which I won't find on id.loc.gov as not
> authorised)
>
> At this point I could stop, but my feeling is that it is useful to also look
> for other combinations of the terms:
>
> Education--England (not authorised)
> Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008)
>
> My theory is that as long as I stick to combinations that start with a
> topical term I'm not going to make startlingly inaccurate statements?

I would definitely ask this question somewhere other than Code4lib
(autocat, maybe?), since I think the answer is more complicated than
this (although they could validate/invalidate your assumption about
whether or not this approach would get you "close enough").

My understanding is that Education--England--Finance *is* authorized,
because Education--Finance is and England is a free-floating
geographic subdivision.  Because it's also an authorized heading,
"Education--England--Finance" is, in fact, an authority.  The problem
is that free-floating subdivisions cause an almost infinite number of
permutations, so there aren't LCCNs issued for them.

This is where things get super-wonky.  It's also the reason I
initially created lcsubjects.org, specifically to give these (and,
ideally, locally controlled subject headings) a publishing
platform/centralized repository, but it quickly grew to be more than
"just a side project".  There were issues of how the data would be
constructed (esp. since, at the time, I had no access to the NAF), how
to reconcile changes, provenance, etc.  Add to the fact that 2 years
ago, there wasn't much linked library data going on, it was really
hard to justify the effort.

But, yeah, it would be worth running your ideas by a few catalogers to
see what they think.

-Ross.


Re: [CODE4LIB] MARC magic for file

2011-04-08 Thread Sean Hannan
http://i.imgur.com/6WtA0.png

(Sorry, it's Friday. Also, blame dchud for the idea.)

-Sean


On 4/6/11 4:53 PM, "Mike Taylor"  wrote:

> On 6 April 2011 19:53, Jonathan Rochkind  wrote:
>> On 4/6/2011 2:43 PM, William Denton wrote:
>>> 
>>> "Validity" does mean something definite ... but Postel's Law is a good
>>> guideline, especially with the swamp of bad MARC, old MARC, alternate
>>> MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
>>> of file and its magic---we can identify technically invalid but still
>>> usable MARC, that's good.
>> 
>> Hmm, accept in the case of Web Browsers, I think general consensus is
>> Postel's law was not helpful. These days, most people seem to think that
>> having different browsers be tolerant of invalid data in different ways was
>> actually harmful rather than helpful to inter-operability (which is
>> theoretically the goal of Postel's law), and that's not what people do
>> anymore in web browser land, at least not to the extremes they used to do
>> it.
> 
> But the idea that browsers should be less permissive in what they
> accept is a modern one that we now have the luxury of only because
> adherence to Postel's law in the early days of the Web allowed it to
> become ubiquitous.  Though it's true, as Harvey Thompson has observed
> that "it's difficult to retro-fit correctness", Clay Shirky was also
> very right when he pointed out that "You cannot simultaneously have
> mass adoption and rigor".  If browsers in 1995 had been as pedantic as
> the browsers of 2011 (rightly) are, we wouldn't even have the Web; or
> if it existed at all it would just be a nichey thing that a few
> scientists used to make their publications available to each other.
> 
> So while I agree that in the case of HTML we are right to now be
> moving towards more rigorous demands of what to accept (as well, of
> course, as being conservative in what we emit), I don't think we could
> have made the leap from nothing to modern rigour.
> 
> -- Mike


[CODE4LIB] Win $450 for the best Personal Data Mashup!

2011-04-08 Thread Jodi Schneider
Of possible interest. -Jodi

Begin forwarded message:

> From: Laura Dragan 
> Date: 8 April 2011 13:11:19 GMT+01:00
> To: deri.ie-resea...@lists.deri.org
> Subject: [Deri.ie-research] Win 450USD for the best Personal Data Mashup!
> 
> Personal Data Mashup Challenge
>   http://semanticweb.org/wiki/PSD2011Challenge
> 
> 
> Your computer is overflowing with applications for managing your data: 
> your photos, your documents, your calendar, your email, etc. On the 
> other side of your Internet connection, the web is overflowing with 
> services for creating, managing, and sharing many of the same things.
> 
> We believe that Semantic Technology can be used for linking, 
> categorizing and combining the data from all these sources, giving you 
> an overall view that no single application can match.
> 
> If you agree - come to PSD2011 [1] - show us how and get rich(*) and 
> famous!
> 
> The challenge prize is kindly sponsored by the Open Semantic 
> Collaboration Architecture Foundation (OSCAF) [2].
> 
> Submission deadline is 1st June 2011. The winner will be announced 
> on the 26th June, at PSD2011.
> 
> 
> Regards, 
> The PSD2011 organizing committee
> 
> 
> [1] http://semanticweb.org/wiki/PSD2011
> [2] http://www.oscaf.org/


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Owen Stephens
Thanks for all the information and discussion.

I don't think I'm familiar enough with Authority file formats to completely
comprehend - but I certainly understand the issues around the question of
'place' vs 'histo-geo-poltical entity'. Some of this makes me worry about
the immediate applicability of the LC Authority files in the Linked Data
space - someone said to me recently 'SKOS is just a way of avoiding dealing
with the real semantics' :)

Anyway - putting that to one side, the simplest approach for me at the
moment seems to only look at authorised LCSH as represented on id.loc.gov.
Picking up on Andy's first response:

On Thu, Apr 7, 2011 at 3:46 PM, Houghton,Andrew  wrote:

> After having done numerous matching and mapping projects, there are some
> issues that you will face with your strategy, assuming I understand it
> correctly. Trying to match a heading starting at the left most subfield and
> working forward will not necessarily produce correct results when matching
> against the LCSH authority file. Using your example:
>
>
>
> 650 _0 $a Education $z England $x Finance
>
>
>
> is a good example of why processing the heading starting at the left will
> not necessarily produce the correct results.  Assuming I understand your
> proposal you would first search for:
>
>
>
> 150 __ $a Education
>
>
>
> and find the heading with LCCN sh85040989. Next you would look for:
>
>
>
> 181 __ $z England
>
>
>
> and you would NOT find this heading in LCSH.
>

OK - ignoring the question of where the best place to look for this is - I
can live with not matching it for now. Later (perhaps when I understand it
better, or when these headings are added to id.loc.gov we can revisit this)


> The second issue using your example is that you want to find the “longest”
> matching heading. While the pieces parts are there, so is the enumerated
> authority heading:
>
>
>
> 150 __ $a Education $z England
>
>
>
> as LCCN sh2008102746. So your heading is actually composed of the
> enumerated headings:
>
>
>
> sh2008102746150 __ $a Education $z England
>
> sh2002007885180 __ $x Finance
>
>
>
> and not the separate headings:
>
>
>
> sh85040989 150 __ $a Education
>
> n82068148   150 __ $a England
>
> sh2002007885180 __ $x Finance
>
>
>
> Although one could argue that either analysis is correct depending upon
> what you are trying to accomplish.
>
>
>

What I'm interested in is representing the data as RDF/Linked Data in a way
that opens up the best opportunities for both understanding and querying the
data. Unfortunately at the moment there isn't a good way of representing
LCSH directly in RDF (the MADS work may help I guess but to be honest at the
moment I see that as overly complex - but that's another discussion).

What I can do is make statements that an item is 'about' a subject (probably
using dc:subject) and then point at an id.loc.gov URI. However, if I only
express individual headings:
Education
England (natch)
Finance

Then obviously I lose the context of the full heading - so I also want to
look for
Education--England--Finance (which I won't find on id.loc.gov as not
authorised)

At this point I could stop, but my feeling is that it is useful to also look
for other combinations of the terms:

Education--England (not authorised)
Education--Finance (authorised! http://id.loc.gov/authorities/sh85041008)

My theory is that as long as I stick to combinations that start with a
topical term I'm not going to make startlingly inaccurate statements?


> The matching algorithm I have used in the past contains two routines. The
> first f(a) will accept a heading as a parameter, scrub the heading, e.g.,
> remove unnecessary subfield like $0, $3, $6, $8, etc. and do any other
> pre-processing necessary on the heading, then call the second function f(b).
> The f(b) function accepts a heading as a parameter and recursively calls
> itself until it builds up the list LCCNs that comprise the heading. It first
> looks for the given heading when it doesn’t find it, it removes the **last
> ** subfield and recursively calls itself, otherwise it appends the found
> LCCN to the returned list and exits. This strategy will find the longest
> match.
>

Unless I've misunderstood this, this strategy would not find
'Education--Finance'? Instead I need to remove each *subdivision* in turn
(no matter where it appears in the heading order) and try all possible
combinations checking each for a match on id.loc.gov. Again, I can do this
without worrying about possible invalid headings, as these wouldn't have
been authorised anyway...

I can check the number of variations around this but I guess that in my
limited set of records (only 30k) there will be a relatively small number of
possible patterns to check.

Does that make sense?


[CODE4LIB] Mapping vocabularies (was: LCSH and Linked Data)

2011-04-08 Thread Jakob Voss

Hi,

Any transformation of a controlled vocabulary, either in format (MARC to 
RDF) or in coverage (e.g. vom LCSH to DDC, MeSH, GND, etc.) has to 
decide whether


(a) there is a one-to-one (or one-to-zero) mapping between all concepts
(b) you need n-to-m or even more complex mappings

Mapping name authority files in VIAF was one of (a) because we more or 
less agree on hat a person is always the same person. But


It looks like mapping authority data in MARC from different institutions 
is an instance of (b). Not only are concepts like "England" more fuzzy 
than people, but they are also used in different context for different 
purpose, depending on the cataloging rules and their specific 
interpretation. It does not help to argue about MARC field because there 
just is no easy one-to-one mapping between for instance:


- The Kingdom of England (927–1707)
- The area of the Kingdom of England (927–1707)
- The country England as today
- The area of England including the Principality of Sealand
- The area of England excluding the Principality of Sealand
- The whole Island Great Britain
- The Island Great Britain including Ireland
- The Island Great Britain including Northern Ireland
- The Kingdom of Great Britain (1707 to 1801)
- The United Kingdom of Great Britain and Ireland (1801 to 1922)
- etc.

I gave a talk about the fruitless attempt to put reality in terms of 
Semantic Web at Wikimania 2007 (stating with slide 12):

http://www.slideshare.net/NCurse/jakob-voss-wikipedia2007

Instead of discussing how to map terms and concepts "the right way" you 
should think about how to express fuzzy and complex mappings. The SKOS 
mapping vocabulary provides some relations for this purpose. I can also 
recommend the DC2010 paper "Establishing a Multi-Thesauri-Scenario based 
on SKOS and Cross-Concordances" by Mayr, Zapilko, and Sure:

http://dcpapers.dublincore.org/ojs/pubs/article/viewArticle/1031

If you do not want to bother with complex mappings but prefer 
one-to-one, you should not talk about differences like England as 
corporate body or as England as place or England as nationality etc.


Sure you can put all these meanings into a broad and fuzzy term 
"England" but than stop complaining about semantic differences and use 
the term as unqualified subject heading with no specific meaning for 
anything that is related to any of the many ideas that anyone can call 
"England". This is the way that full text retrieval works.


You just can't have both simple mappings and precise terms.

Jakob

--
Jakob Voß , skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] LCSH and Linked Data

2011-04-08 Thread Till Kinstler
Am 07.04.2011 17:44, schrieb Ford, Kevin:

> Actually, it appears to depend on whose Authority record you're looking at.  
> The Canadians, Australians, and Israelis have it as a CorporateName (110), as 
> do the French (210 - unimarc); LC and the Germans say it's a Geographic Name.

No, the original "England" record linked to VIAF in the German GND says
it is a "Gebietskörperschaft", which is a corporate body in English.
See http://d-nb.info/gnd/15138-5/about/html and the RDF representation
at http://d-nb.info/gnd/15138-5/about/rdf
Perhaps something went "wrong" in the mapping of the German authority
record to MARC21, so "England" got into the 151 (or there might be good
reasons to do it that way, ask metadata experts...). The original record
is not maintained in MARC21, we don't do MARC21 (or any MARC at all)
here, we are just starting to switch to it as future(!) exchange
format... :-).
Sorry for being pedantic, early morning and not enough coffee yet...

Till

-- 
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de