Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)

2010-05-03 Thread Eric Hellman

I'll try to find out.

Sent from Eric Hellman's iPhone


On May 2, 2010, at 4:10 PM, stuart yeates stuart.yea...@vuw.ac.nz  
wrote:


But the interesting use case isn't OpenURL over HTTP, the  
interesting use case (for me) is OpenURL on a disconnected eBook  
reader resolving references from one ePub to other ePub content on  
the same device. Can OpenURL be used like that?


Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)

2010-05-03 Thread Jonathan Rochkind
Here is the API response Umlaut provides to OpenURL requests with 
standard scholarly formats.  This API response is of course to some 
extent customized to Umlaut's particular context/use cases, it was not 
neccesarily intended to be any kind of standard -- certainly not with as 
wide-ranging intended domain as OpenURL's 1.0 intent (which never really 
caught on), it's targetted at standard actually-existing link resolver 
use cases in the scholarly environment.


But, here you go, live even:

http://findit.library.jhu.edu/resolve/api?sid=googleauinit=ABaulast=Milleratitle=Reporting+results+of+cancer+treatmentid=doi:10.1002/1097-0142%2819810101%2947:1%3C207::AID-CNCR2820470134%3E3.0.CO%3B2-6title=Cancervolume=47issue=1date=2006spage=207issn=0008-543X

Json is also available. Note that complete results do not neccesarily 
show up at first, some information is still being loaded in the 
background.  You can refresh the URL to see more results, you'll know 
when the back-end server has nothing left to give you when 
completetrue/complete is present.


Another XML response with embedded HTML snippets is also available (in 
both XML and Json):


http://findit.library.jhu.edu/resolve/partial_html_sections?sid=googleauinit=ABaulast=Milleratitle=Reporting+results+of+cancer+treatmentid=doi:10.1002/1097-0142%2819810101%2947:1%3C207::AID-CNCR2820470134%3E3.0.CO%3B2-6title=Cancervolume=47issue=1date=2006spage=207issn=0008-543X


Ross Singer wrote:

On Fri, Apr 30, 2010 at 10:08 AM, Eric Hellman e...@hellman.net wrote:
  

OK, what does the EdSuRoSi spec for OpenURL responses say?



Well, I don't think it's up to us and I think it's dependent upon
community profile (more than Z39.88 itself), since it would be heavily
influenced with what is actually trying to be accomplished.

I think the basis of a response could actually be another context
object with the 'services' entity containing a list of
services/targets that are formatted in some way that is appropriate
for the context and the referent entity enhanced with whatever the
resolver can add to the puzzle.

This could then be taken to another resolver for more services layered on.

This is just riffing off the top of my head, of course...
-Ross.

  


Re: [CODE4LIB] SRU/ZeeRex explain question : record schemas

2010-05-03 Thread Jonathan Rochkind

Thanks Ray, I believe it is!

A schema listed there is available for requesting with the 
recordSchema= parameter, yes? Cool, that's exactly what I was looking 
for.


Another question though. I note when looking up schemaInfo... I'm a bit 
confused by the sort attribute.  How could you sort by a schema? What 
is this attribute actually for? 


Jonathan

Ray Denenberg, Library of Congress wrote:

schemaInfo is what you're looking for I think.

Look at http://z3950.loc.gov:7090/voyager.

Line 74, for example,
schemaInfo
schema identifier=info:srw/schema/1/marcxml-v1.1 sort=false 
name=marcxml

titleMARCXML/title
/schema


Is this what you're looking for?

--Ray


- Original Message - 
From: Jonathan Rochkind rochk...@jhu.edu

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Friday, April 30, 2010 3:57 PM
Subject: [CODE4LIB] SRU/ZeeRex explain question : record schemas


  

This page:
http://www.loc.gov/standards/sru/resources/schemas.html

says:

The Explain document lists the XML schemas for a given database in which 
records may be transferred. Every schemas is unambiguously identified by a 
URI and a server may assign a short name, which may or may not be the same 
as the short name listed in the table below (and may differ from the short 
name that another server assigns).



But perusing the SRU/ZeeRex Explain documentation I've been able to find, 
I've been unable to find WHERE in the Explain document this information is 
listed/advertised.


Can anyone clue me in? 



  


Re: [CODE4LIB] It's cool to love milk and cookies

2010-05-03 Thread Kevin S. Clarke
me too.

On Sun, May 2, 2010 at 9:23 PM, Rosalyn Metz rosalynm...@gmail.com wrote:
 I like oreo double stuff. I take one cookie off each sandwich and then take
 two sides with cream and sandwich them together. Voila. Oreo quadruple
 stuff.

 On May 2, 2010 4:05 PM, Michael J. Giarlo leftw...@alumni.rutgers.edu
 wrote:

 EMACS

 -Mike

 On Sun, May 2, 2010 at 14:12, Mark Pernotto mark.perno...@gmail.com wrote:
 I like heavy whipp...



Re: [CODE4LIB] SRU/ZeeRex explain question : record schemas

2010-05-03 Thread Ray Denenberg, Library of Congress

From: Jonathan Rochkind rochk...@jhu.edu
Another question though. I note when looking up schemaInfo... I'm a bit 
confused by the sort attribute.  How could you sort by a schema? What is 
this attribute actually for?



Well indulge me, this is best explained by the current OASIS SRU draft.

(The current and earlier specs don't do a good job here. But for background 
if interested:  sorting as an SRU function was supported in SRU 1.1 and 
taken out of version 1.2, replaced by sorting as a function of the query 
language rather than the protocol. For the OASIS work it's in both.  For the 
current spec at LC, which reflects 1.2, the attribute doesn't even make 
sense. If you go back to the 1.1 archive it does. Still, the OASIS document 
treats it more clearly.)


See http://www.loc.gov/standards/sru/oasis/sru-2-0-draft-most-current.doc 
See section 9.1.


So essentially, when you sort in SRU, you provide an XPath expression.  The 
XPath expression is meaningful in the context of a schema, but the *record 
schema* may not be the most meaningful schema for purposes of sorting, there 
may be another schema more meaningful.  So, you have the capability to 
specify not only a record schema but an auxiliary sort schema.


A given schema that an Explain file lists will usually be one that is used 
as a record schema, but it may also be usable as a sort schema.   That's 
what the sort attribute tells you.


--Ray 


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Riley, Jenn
Hi MJ,

 - for that matter, is there a good example of how to properly
 serialize DCTERMS for eg. a converted MARC/MODS record in XML (or
 RDF/XML)?  I see, eg. http://dublincore.org/documents/dcq-rdf-xml/
 which has been replaced by http://dublincore.org/documents/dc-rdf/
 but I'm not sure if the latter obviates the former entirely?  Also, the
 examples at the bottom of the latter don't show, eg. repeated elements
 or DCMES elements.  Do we abandon http://purl.org/dc/elements/1.1/
 entirely?

This has always been ridiculously confusing! Here's my understanding (though 
anyone else, please chime in and correct me if I've misunderstood):

- With the maturation of the DCMI Abstract Model 
http://dublincore.org/documents/abstract-model/, new bindings were needed to 
express features of the model not obvious in the old RDF, XML, and XHTML 
bindings.

- For RDF, http://dublincore.org/documents/dc-rdf/ is stable and fully 
intended to replace http://dublincore.org/documents/dcq-rdf-xml/.

- For XML (the non-RDF sort), the most current document is 
http://dublincore.org/documents/dc-ds-xml/, though note its status is still 
(after 18 months) only a proposed recommendation. This document itself replaces 
a transition document http://dublincore.org/documents/2006/05/29/dc-xml/ from 
2006 that never got beyond Working Draft status. To get a stable XML binding, 
you have to go all the way back to 2003 
http://dublincore.org/documents/dc-xml-guidelines/index.shtml, a binding 
which predates much of the current DCMI Abstract Model.

- Many found the 2003 XML binding unsatisfactory in that it prescribed the 
format for individual dc and dcterms properties, but not a full XML format - 
that is, there was no DC-sanctioned XML root element for a qualified DC 
record. (This gets at the very heart of the difference in perspective between 
RDF and XML, properties and elements, etc., I think, but I digress...) The 
folks I'm aware of that developed workarounds for this were those sharing QDC 
over OAI-PMH. I find the UIUC OAI registry 
http://oai.grainger.uiuc.edu/registry/ helpful for investigations of this 
sort. A quick glance at their report on Distinct Metadata Schemas used in 
OAI-PMH data providers http://oai.grainger.uiuc.edu/registry/ListSchemas.asp 
seems to suggest that CONTENTdm uses this schema for QDC 
http://epubs.cclrc.ac.uk/xsd/qdc.xsd and DSpace uses this one 
http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd. The latter 
doesn't actually define a root element either, but since here a!
 t least the QDC is inside the wrappers the OAI-PMH response requires it's 
well-formed. What someone does with that once they get it and unpack it, I 
don't know, since without a container it won't be well-formed XML. The former 
goes through several levels of importing other things and eventually ends up 
importing from an .xsd on the Dublin Core site, but they define a root element 
themselves along the way. (I think.)

- So what does one do? I guess it depends on who your target consumers of this 
data are. If you're looking to work with more traditional library environments, 
perhaps those that are using CONTENTdm, etc. the legacy hack-ish format might 
be the best. (I'm part of an initiative to revitalize the Sheet Music 
Consortium http://digital.library.ucla.edu/sheetmusic/ and lots of our 
potential contributors are CONTENTdm users, so I think this is the direction 
I'm going to take that project.) But if you're wanting to talk to DCMI-style 
folks, the dc-ds-xml, or more likely the dc-rdf option seems more attractive. 
I'm afraid I'm not much help with the implementation details of dc-rdf, though. 
One of the DC mailing list would be, though, I suspect. There are a lot of 
active members there.

Ick, huh? :-)

Jenn


Jenn Riley
Metadata Librarian
Digital Library Program
Indiana University - Bloomington
Wells Library W501
(812) 856-5759
www.dlib.indiana.edu

Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com


Re: [CODE4LIB] SRU/ZeeRex explain question : record schemas

2010-05-03 Thread Jonathan Rochkind

This makes some amount of sense, thanks.

I actually kind of liked the sorting as part of CQL in SRU 1.2.  I see 
how XPath sorting can be convenient too.


But you will leave sorting as part of CQL too in any changes to CQL 
specs, I hope?  I think CQL has a lot of use even outside of SRU proper, 
so I encourage you to leave it's spec not too tightly coupled to SRU. 

I think there are at least three ways to sort as part of (different 
versions of?) SRU now! 


1) An actual separate sortKeys query paramater
2) Included in the CQL expression in query, using the sortBy keyword.
3) In draft not finalized, OASIS/SRU 2.0 methods of specifying XPaths 
for sorting.  [Thanks for including the link to the current SRU 2.0 
draft, I didn't know that was publically available anywhere, it's not 
really googlable].


Do I have this right?  As SRU 1.2 is the only actual spec I have to work 
with... am I right that either top-level sortKeys, or embedded in CQL 
with sortBy would both be legal in SRU 1.2 (whether a given server 
supports one or both of them is a different question -- but they are 
both legal to spec, yes?).


I'd actually strongly encourage you to leave both of them as legal to 
spec in SRU 2.0, they make things much simpler to work with (although 
also less flexible; that's generally the trade-off) then requiring 
XPath's to be specified. Especially when a corpus being searched may 
include records in diverse and varied and inconsistent record schemas.


Jonathan

Ray Denenberg, Library of Congress wrote:

From: Jonathan Rochkind rochk...@jhu.edu
  
Another question though. I note when looking up schemaInfo... I'm a bit 
confused by the sort attribute.  How could you sort by a schema? What is 
this attribute actually for?




Well indulge me, this is best explained by the current OASIS SRU draft.

(The current and earlier specs don't do a good job here. But for background 
if interested:  sorting as an SRU function was supported in SRU 1.1 and 
taken out of version 1.2, replaced by sorting as a function of the query 
language rather than the protocol. For the OASIS work it's in both.  For the 
current spec at LC, which reflects 1.2, the attribute doesn't even make 
sense. If you go back to the 1.1 archive it does. Still, the OASIS document 
treats it more clearly.)


See http://www.loc.gov/standards/sru/oasis/sru-2-0-draft-most-current.doc 
See section 9.1.


So essentially, when you sort in SRU, you provide an XPath expression.  The 
XPath expression is meaningful in the context of a schema, but the *record 
schema* may not be the most meaningful schema for purposes of sorting, there 
may be another schema more meaningful.  So, you have the capability to 
specify not only a record schema but an auxiliary sort schema.


A given schema that an Explain file lists will usually be one that is used 
as a record schema, but it may also be usable as a sort schema.   That's 
what the sort attribute tells you.


--Ray 

  


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Jonathan Rochkind
I'm still confused about all this stuff too, but I've often see the 
oai_dc format (for OAI/PMH I think?) used as a 'standard' way to expose 
simple DC attributes.


One thing I was confused about was whether the oai_dc format _required_ 
the use of the old style DC uri's, or also allowed the use of the 
DCterms URIs?   Anyone know?  I kind of think it actually requires the 
old-style DC uri's, as it was written before dcterms. 

At least it is one standardized way to expose the old basic DC elements, 
with a specific XML schema.


Jonathan

Riley, Jenn wrote:

Hi MJ,

  

- for that matter, is there a good example of how to properly
serialize DCTERMS for eg. a converted MARC/MODS record in XML (or
RDF/XML)?  I see, eg. http://dublincore.org/documents/dcq-rdf-xml/
which has been replaced by http://dublincore.org/documents/dc-rdf/
but I'm not sure if the latter obviates the former entirely?  Also, the
examples at the bottom of the latter don't show, eg. repeated elements
or DCMES elements.  Do we abandon http://purl.org/dc/elements/1.1/
entirely?



This has always been ridiculously confusing! Here's my understanding (though 
anyone else, please chime in and correct me if I've misunderstood):

- With the maturation of the DCMI Abstract Model 
http://dublincore.org/documents/abstract-model/, new bindings were needed to 
express features of the model not obvious in the old RDF, XML, and XHTML bindings.

- For RDF, http://dublincore.org/documents/dc-rdf/ is stable and fully intended to 
replace http://dublincore.org/documents/dcq-rdf-xml/.

- For XML (the non-RDF sort), the most current document is 
http://dublincore.org/documents/dc-ds-xml/, though note its status is still (after 18 
months) only a proposed recommendation. This document itself replaces a transition document 
http://dublincore.org/documents/2006/05/29/dc-xml/ from 2006 that never got beyond 
Working Draft status. To get a stable XML binding, you have to go all the way back to 2003 
http://dublincore.org/documents/dc-xml-guidelines/index.shtml, a binding which predates 
much of the current DCMI Abstract Model.

- Many found the 2003 XML binding unsatisfactory in that it prescribed the format for individual dc and dcterms 
properties, but not a full XML format - that is, there was no DC-sanctioned XML root element for a qualified DC 
record. (This gets at the very heart of the difference in perspective between RDF and XML, properties 
and elements, etc., I think, but I digress...) The folks I'm aware of that developed workarounds for this were 
those sharing QDC over OAI-PMH. I find the UIUC OAI registry http://oai.grainger.uiuc.edu/registry/ 
helpful for investigations of this sort. A quick glance at their report on Distinct Metadata Schemas used in 
OAI-PMH data providers http://oai.grainger.uiuc.edu/registry/ListSchemas.asp seems to suggest that 
CONTENTdm uses this schema for QDC http://epubs.cclrc.ac.uk/xsd/qdc.xsd and DSpace uses this one 
http://dublincore.org/schemas/xmls/qdc/2006/01/06/dcterms.xsd. The latter doesn't actually define a root 
element either, but since here!

 a!

 t least the QDC is inside the wrappers the OAI-PMH response requires it's 
well-formed. What someone does with that once they get it and unpack it, I 
don't know, since without a container it won't be well-formed XML. The former 
goes through several levels of importing other things and eventually ends up 
importing from an .xsd on the Dublin Core site, but they define a root element 
themselves along the way. (I think.)

- So what does one do? I guess it depends on who your target consumers of this data 
are. If you're looking to work with more traditional library environments, perhaps 
those that are using CONTENTdm, etc. the legacy hack-ish format might be the best. 
(I'm part of an initiative to revitalize the Sheet Music Consortium 
http://digital.library.ucla.edu/sheetmusic/ and lots of our potential 
contributors are CONTENTdm users, so I think this is the direction I'm going to take 
that project.) But if you're wanting to talk to DCMI-style folks, the dc-ds-xml, or 
more likely the dc-rdf option seems more attractive. I'm afraid I'm not much help 
with the implementation details of dc-rdf, though. One of the DC mailing list would 
be, though, I suspect. There are a lot of active members there.

Ick, huh? :-)

Jenn


Jenn Riley
Metadata Librarian
Digital Library Program
Indiana University - Bloomington
Wells Library W501
(812) 856-5759
www.dlib.indiana.edu

Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com

  


[CODE4LIB] SRU/ZeeRex explain question : CQL version?

2010-05-03 Thread Jonathan Rochkind
Ah, I think I was wrong below. I must have been looking at different 
versions of the SRU spec without realizing it.


SRU 1.1 includes a sortKeys parameter, and CQL 1.1 does not include a 
sortBy clause.


SRU 1.2 does NOT include a sortKeys parameter, and CQL 1.2 does 
include a sortBy clause.


Okay in the SRU/ZeeRex explain document, how do you advertise which 
version of CQL you support, 1.1 or 1.2?  Or is this just implied by 
which version of SRU you support, 1.1 or 1.2?   How do you advertise 
THAT in an SRU/ZeeRex explain?


Jonathan

Jonathan Rochkind wrote:
I think there are at least three ways to sort as part of (different 
versions of?) SRU now! 


1) An actual separate sortKeys query paramater
2) Included in the CQL expression in query, using the sortBy keyword.
3) In draft not finalized, OASIS/SRU 2.0 methods of specifying XPaths 
for sorting.  [Thanks for including the link to the current SRU 2.0 
draft, I didn't know that was publically available anywhere, it's not 
really googlable].


  


Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

2010-05-03 Thread Jakob Voss

Hi Stuart,

These have been included because they are in widespread use in a current 
written culture. The problems I personally have are down to characters 
used by a single publisher in a handful of books more than a hundred 
years ago. Such characters are explicitly excluded from Unicode.


In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.


That is a matter of discussion. If you do not call it 'ligature' chances 
are higher to get it included.



To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words


That's interesting and reminds me on the treatment of mathematical 
formula in journal titels which mostly end up as ugly images.


In Unicode you are allowed to assign private characters

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters

The U+200D ZERO WIDTH JOINER could also be used but most browsers will 
not support it - you need a font that supports your character anyway.


http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx

In summary: Unicode is just a subset of all characters which have been 
used for written communication and whether a character gets included 
depends not only on objective properties but on lobbying and other 
circumstances. The deeper you dig the more nasty Unicode gets - as all 
complex formats and standards.


Cheers
Jakob

P.S: Michael Kaplan's  blog also contains a funny article about emoji: 
http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] It's cool to love milk and cookies

2010-05-03 Thread Joe Hourcle
You know, there are some of us who are milk intolerant on this mailing 
list.


And emacs intolerant, too.  (although, I did use 'ee' as my editor in elm, 
but elm took too long to support MIME, so I switched to pine, with their 
pico default editor, but I don't use any of those I mentioned for coding, 
even though I am in pico/pine right now, as I still haven't switched to 
alpine or mutt)


-Joe


Re: [CODE4LIB] It's cool to love milk and cookies

2010-05-03 Thread Ross Singer
But is there a NISO standard for this?

On Fri, Apr 30, 2010 at 7:13 PM, Simon Spero s...@unc.edu wrote:
 I like chocolate milk.



Re: [CODE4LIB] It's cool to love milk and cookies

2010-05-03 Thread Aaron Rubinstein

C-u 2 double-stuff

Aaron

On 5/2/2010 9:23 PM, Rosalyn Metz wrote:

I like oreo double stuff. I take one cookie off each sandwich and then take
two sides with cream and sandwich them together. Voila. Oreo quadruple
stuff.

On May 2, 2010 4:05 PM, Michael J. Giarloleftw...@alumni.rutgers.edu
wrote:

EMACS

-Mike

On Sun, May 2, 2010 at 14:12, Mark Pernottomark.perno...@gmail.com  wrote:

I like heavy whipp...


Re: [CODE4LIB] It's cool to love milk and cookies

2010-05-03 Thread Jay Luker
I believe there is an organization called NABISCO that is working on one.

--jay

On Mon, May 3, 2010 at 10:40 AM, Ross Singer rossfsin...@gmail.com wrote:
 But is there a NISO standard for this?

 On Fri, Apr 30, 2010 at 7:13 PM, Simon Spero s...@unc.edu wrote:
 I like chocolate milk.




Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

2010-05-03 Thread Jonathan Rochkind
Hmm, you could theoretically assign chars in the private unicode area to 
the chars you need -- but then have your application replace those chars 
by small images on rendering/display.


This seems as clean a solution as you are likely to find. Your TEI 
solution still requires chars-as-images for these unusual chars, right?  
So this is no better with regard to copying-and-pasting, browser 
display,  and general interoperability than your TEI solution, but no 
worse either -- it's pretty much the same thing. But it may be better in 
terms of those considerations for chars that actually ARE currently 
unicode codepoints.


If any of your private chars later become non-private unicode 
codepoints, you could always globally replace your private codepoints 
with the new standard ones.


With 137K private codepoints available, you _probably_ wouldn't run 
out. I think.  You could try standardizing these private codepoints 
among people in similar contexts/communities to you and your needs -- it 
looks like there are several existing efforts to document shared uses of 
private codepoints for chars that do not have official unicode 
codepoints. They are mentioned in the wikipedia article. 

[Reading that wikipedia article taught me something new I didn't know 
about Marc21 and unicode too -- a topic generally on top of my pile 
these days -- The MARC 21 standard uses the [Private Use Area] to 
encode East Asian characters present in MARC-8 that have no Unicode 
encoding. Who knew? ]


Jonathan

Jakob Voss wrote:

Hi Stuart,

  
These have been included because they are in widespread use in a current 
written culture. The problems I personally have are down to characters 
used by a single publisher in a handful of books more than a hundred 
years ago. Such characters are explicitly excluded from Unicode.


In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.



That is a matter of discussion. If you do not call it 'ligature' chances 
are higher to get it included.


  

To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words



That's interesting and reminds me on the treatment of mathematical 
formula in journal titels which mostly end up as ugly images.


In Unicode you are allowed to assign private characters

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters

The U+200D ZERO WIDTH JOINER could also be used but most browsers will 
not support it - you need a font that supports your character anyway.


http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx

In summary: Unicode is just a subset of all characters which have been 
used for written communication and whether a character gets included 
depends not only on objective properties but on lobbying and other 
circumstances. The deeper you dig the more nasty Unicode gets - as all 
complex formats and standards.


Cheers
Jakob

P.S: Michael Kaplan's  blog also contains a funny article about emoji: 
http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx


  


[CODE4LIB] [job posting] Systems Programmer, University of Michigan Library IT

2010-05-03 Thread Cory Snavely
The University of Michigan Library is looking for a talented,
resourceful systems programmer to develop and maintain software systems.
A principal activity at the library is the development of a massive
digital archiving infrastructure to support our scanning partnership
with Google; the archive currently contains nearly 6 million items (220
TB) and is projected to grow to over 10 million items (400 TB) over the
duration of the project. Programming projects will initially consist of
enhancing the systems that receive and manage images from Google
(including substantial work with validating incoming data and diagnosing
data problems), large-scale transformation of textual and image data,
designing/developing core digital library infrastructure, and monitoring
reliability and performance of services. Projects may include server and
storage administration, depending on candidate interest and ability.
Other tasks will vary but include, for example, preparing documentation
and monitoring technology trends.

BACKGROUND:
The Library Information Technology (LIT) division provides comprehensive
technology support and guidance for the University Library system,
including hosting digital library collections, coordinating electronic
publishing initiatives, and supporting traditional library services
(circulation of materials and management of metadata).

The Core Services unit of LIT concentrates on server infrastructure,
systems integration, and automation of workflows for the library system.
Core Services undertakes projects in a number of technology areas,
including (for example) server deployment and administration,
automation, access control systems used daily by the University
community, and distributed systems that manage the flow of millions of
scanned page images per week. 

Core Services operates a growing server infrastructure based primarily
on Linux, but partially on Solaris, consisting of approximately 80
servers and over 800 TB of storage spread across three data centers.

DEPARTMENT QUALIFICATIONS:
Minimum: Bachelors degree in computer science or an equivalent
combination of education and experience; demonstrated programming
abilities in any applicable language; strong analytical and
troubleshooting skills; excellent verbal and written communication
skills.

Desired: Demonstrated expertise with DAS, NAS, and SAN storage systems;
demonstrated experience in Linux/Solaris administration; demonstrated
experience in database administration; demonstrated experience with
developing XSLT transformations.

NOTE: This is a 2-year term position.

NOTE: Salary dependent on education and previous relevant experience.

TO APPLY:
Apply online by Monday, May 17 using the University of Michigan Jobs
website at http://www.umich.edu/jobs . This position is posted as number
39327, and can be found by searching for the keyword google.


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Ross Singer
Out of curiosity, what is your use case for turning this into DC?
That might help those of us that are struggling to figure out where to
start with trying to help you with an answer.

-Ross.

On Mon, May 3, 2010 at 11:46 AM, MJ Suhonos m...@suhonos.ca wrote:
 Thanks for your comments, guys.  I was beginning to think the lack of 
 response indicated that I'd asked something either heretical or painfully 
 obvious.  :-)

 That's my understanding as well. oai_dc predates the defining of the 15 
 legacy DC properties in the dcterms namespace, and it's my guess nobody saw 
 a reason to update the oai_dc definition after this happened.

 This is at least part of my use case — we do a lot of work with OAI on both 
 ends, and oai_dc is pretty limited due to the original 15 elements.  My 
 thinking at this point is that there's no reason we couldn't define something 
 like oai_dcterms and use the full QDC set based on the updated profile.  
 Right?

 FWIW, I'm not limited to any legacy ties; in fact, my project is aimed at 
 pushing the newer, DC-sanctioned ideas forward, so I suspect in my case 
 using an XML serialization that validates against http://purl.org/dc/terms/ 
 is probably sufficient (whether that's RDF or not doesn't matter at this 
 point).

 So, back to the other part of the question:  has anybody seen a MODS — 
 DCTERMS crosswalk in the wild?  It looks like there's a lot of similarity 
 between the two, but before I go too deep down that rabbit hole, I'd like to 
 make sure someone else hasn't already experienced that, erm, joy.

 MJ



Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread MJ Suhonos
 dcterms so so terribly lossy that it would be a shame to reduce MARC to it.

This is *precisely* the other half of my rationale — a shame?  Why?  If MARC is 
the mind prison that some purport it to be, then let's see what a system built 
devoid of MARC, but based on the best alternative we have looks like.

That may well *not* be DCTERMS, but I do like the DCAM model, and there are 
plenty of non-library systems out there that speak simple DC (OAI-PMH is one 
example from this thread alone).  Being conceptually RDF-compatible is just a 
bonus for me.

This would be an incentive for them to at least consider implementing DCTERMS, 
which may be terribly lossy compared to MARC, but is a huge increase in 
expressivity compared to simple DC.  Integrating MARC-based records and 
DC-based records from OAI sources in a single database could be a useful thing 
to play with.

 What we need, ASAP, is a triple form of MARC (and I know some folks have 
 experimented with this...) and a translate from MARC to the RDA elements that 
 have been registered in RDF. However, I hear that JSC is going to be adding 
 more detail to the RDA elements so that could mean changes coming down the 
 pike.  I am interested in working on MARC as triples, which I see as a 
 transformation format. I have a database of MARC elements that might be a 
 crude basis for this.

This seems like it's looking to accomplish different goals than I am, but 
obviously if there's a MARC-as-triples intermediary that's workable *today* 
then I'd be happy to use that instead.  But I wonder: how navigable is it by 
people who don't understand MARC?  How much loss is potentially involved?

 QDC basically represents the same things has dcterms, so you can
 probably just take the existing XSLT and hack on it until it until it
 represents something that looks more like dcterms than qdc.

Yeah, that might be easier than mapping from MODS, though I'll have to see how 
much I can look at a MARC-based XSLT before my brain melts.  Hopefully it 
wouldn't take *too* much work.

 That won't address of the issue of breaking up the MARC into
 individual resources, however.  You mention that you are looking for
 the short hop to RDF, but this is just going to give you a big pile of
 literals for things like creator/contributor/subject, etc.  I'm not
 really sure what the win would be, there.

Well, a MARC-as-triples approach would suffer from the same problem just as 
much, at least initially.  I think the issue of converting literals into URIs 
is an important second step, but let's get the literals into a workable format 
first.

I should clarify that my ultimate goal isn't to find a magical easy way to RDF, 
but rather to try to realize a way for libraries to get their data into a 
format that others are able and willing to play with.  I'm betting on the 
notion that the majority of (presumably non-librarian) users would rather have 
incomplete data in a format that they can understand and manipulate, rather 
than have to learn MARC.  I certainly would, and I'm a librarian (though 
probably a poor one because I don't understand or highly value MARC).

Naive? Heretical? Probably.  But worth a shot, I think.

MJ


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread MJ Suhonos
 NB: When Karen Coyle, Eric Morgan, and Roy Tennant all reply to your thread 
 within half an hour of each other, you know you've hit the big time.  Time to 
 retire young I think.

That would be Eric *Lease* Morgan — oh my god, you're right!  I'm already 
losing data!  It *is* insidious!  I repent!

MJ


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Aaron Rubinstein

On 5/3/2010 1:55 PM, Karen Coyle wrote:


1. MARC the data format -- too rigid, needs to go away
2. MARC21 bib data -- very detailed, well over 1,000 different data
elements, some well-coded data (not all); unfortunately trapped in #1


For the sake of my own understanding, I would love an explanation of the 
distinction between #1 and #2...  Re: #2, how is bibliographic data 
encoded in MARC any different than bibliographic data encoded in some 
other format?  Without the encoding format, you just have a pile of 
strings, right?  I agree that we have lots of rich bibliographic data 
encoded in MARC and it is an exciting possibility to move it out of MARC 
into other, more flexible formats.  Why, then, do we need to migrate the 
'elements' of the encoding format as well?  Taking one look at MARCXML 
makes it clear that the structure of MARC is not well suited to 
contemporary, *interoperable*, data formats.


Is there something specific to MARC that is not potentially covered by 
MODS/DCTERMS/BIBO/??? that I'm missing?


Thanks,

Aaron


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 2:40 PM, MJ Suhonos m...@suhonos.ca wrote:

 Yes, even to me as a librarian but not a cataloguer, many (most?) of these
 elements seem like overkill.  I have no doubt there is an edge-case for
 having this fine level of descriptive detail, but I wonder:

 a) what proportion of records have this level of description
 b) what kind of (or how much) user access justifies the effort in creating
 and preserving it


On many levels, I agree. Or I wish I could.

If you look at a business model like Amazon, for example, it's easy to
imagine that their overriding goal is, Make the easy-to-find stuff
ridiculously easy to find. The revenue they get from someone finding an
edge-case book is exactly the same as the revenue they get from someone
buying Harry Potter. The ROI easy to think about.

But I work in an academic library. In a lot of ways, our *primary audience*
is some grad student 12 years from now who needs one trivial piece of crap
to make it all come together in her head. I know we have thousands of books
that have never been looked at, but computing the ROI on someone being able
to see them some day is difficult. Maybe it's zero. Maybe not. We just can't
tell.

Now, none of this is to say that MARC/AACR2 is necessarily the best (or even
a good) way to go about making these works findable. I'm just saying that
evaluating the edge cases in terms of user access are a complicated
business.

  -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Beacom, Matthew
Although I agree with Roy's suggestion that librarians not gloat about our 
metadata, the notion that the value of a data element can be elicited from the 
frequency of its use in the overall domain of library materials is misleading 
and contrary to the report Roy cites. 

The sub-section of the very useful and informative OCLC report that Roy cites 
is very good on this point. Section 2. MARC Tag Usage in WorldCat by Karen 
Smith-Yoshimura clearly lays out the data in the context of WorldCat and the 
cataloging practice of the OCLC members.  

Library holdings are dominated by texts and in terms of titles cataloged texts 
are dominated by books. This preponderance of books tilts the ratios of use per 
individual data elements. Many data elements pertain to either a specific form 
of material, manuscripts, for instance. Others pertain to specific content, 
musical notation, for instance. Some pertain to both, manuscript scores, for 
instance. Within the total aggregate of library materials, data elements that 
are specific per material or content do not rise in usage rates to anything 
near 20% of the aggregate total of titles. Yet these elements are necessary or 
valuable to those wishing to discover and use the materials, and when one 
recalls that 1% use rates in WorldCat equal about 1,000,000 titles the 
usefulness of many MARC data elements can be seen as widespread.

According to the report, 69 MARC tags occur in more than 1% of the records in 
WorldCat.  That is quite a few more than the Roy's 11, but even accounting for 
Karen's data elements being equivalent to the number of MARC sub-fields this is 
far fewer than the 1,000 data elements available to a cataloger in MARC. 

Matthew Beacom


By the way, the descriptive fields used in more than 20% of the MARC records in 
WorldCat are:

245 Title statement 100%
260 Imprint statement 96%
300 Physical description 91%
100 Main entry - personal name 61%
650 Subject added entry - topical term 46%
500 General note 44%
700 Added entry - personal name 28%

They answer, more or less, a few basic questions a user might have about the 
material:
What is it called? Who made it? When was it made? How big is it? What is it 
about? Answers to the question, How can I get it? are usually given in the 
associated MARC holdings record. 
 

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of Roy 
Tennant
Sent: Monday, May 03, 2010 2:15 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MODS and DCTERMS

I would even argue with the statement very detailed, well over 1,000
different data elements, some well-coded data (not all). There are only 11
(yes, eleven) MARC fields that appear in 20% or more of MARC records
currently in WorldCat[1], and at least three of those elements are control
numbers or other elements that contribute nothing to actual description. I
would say overall that we would do well to not gloat about our metadata
until we've reviewed the facts on the ground. Luckily, now we can.
Roy

[1] http://www.oclc.org/research/publications/library/2010/2010-06.pdf

On Mon, May 3, 2010 at 11:03 AM, Eric Lease Morgan emor...@nd.edu wrote:

 On May 3, 2010, at 1:55 PM, Karen Coyle wrote:

  1. MARC the data format -- too rigid, needs to go away
  2. MARC21 bib data -- very detailed, well over 1,000 different data
  elements, some well-coded data (not all); unfortunately trapped in #1



 The differences between the two points enumerated above, IMHO, seem to be
 the at the heart of the never-ending debate between computer types and
 cataloger types when it comes to library metadata. The non-library computer
 types don't appreciate the value of human-aided systematic description. And
 the cataloger types don't understand why MARC is a really terrible bit
 bucket, especially considering the current environment. All too often the
 two camps don't know to what the other is speaking. MARC must die. Long
 live MARC.

 --
 Eric Lease Morgan



Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)

2010-05-03 Thread Arash.Joorabchi
The stats reported in this paper might help:

http://homes.ukoln.ac.uk/~kg249/publ/RenardusFinal.pdf

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: 03 May 2010 19:09
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] A call for your OPAC (or other system) statistics!
(Browse interfaces)

I got email from a person today saying, and I quote,

 I must say that [the lack of a browse interface] come as a shock
(*which
interface cannot browse??*)

[Emphasis mine]

Here, a browse interface is one where you can get a giant list of all
the
titles/authors/subjects whatever -- a view on the data devoid of any
searching.

Will those of you out there with browse interfaces in your system take
a
couple minutes to send along a guesstimate of what percentage of patron
sessions involve their use?

[Note that for right now, I'm excluding type-ahead search boxes
although
there's an obvious and, in my mind, strong argument to be made that
they're
substantially similar for many types of data]

We don't have a browse interface on our (VuFind) OPAC right now. But in
the
interest of paying it forward, I can tell you that in Mirlyn, our OPAC,
has
numbers like this:

Pct of Mirlyn sessions, Feb/March/April 2010, which included at least
one
basic
search and also:

  Go to full record view  46% (we put a lot of info in search
results)
  Select/favorite an item   15%
  Add a facet:13%
  Export record(s)
   to email/refworks/RIS/etc. 3.4%
  Send to phone (sms) 0.21%
  Click on faq/help/AskUs
 in footer0.17%  (324 total)

Based on 187,784 sessions, 2010.02.01 to 2010.04.31

So...anyone out there able to tell me anything about browse interfaces?

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)

2010-05-03 Thread Tod Olson
Bill, 

Here are relative percentages for our Horizon catalog, based on our 2008-2009 
annual report:

Browse Searches 76.2%
Keyword Searches20.9%
Mulit-index Searches2.9%

That interface presents a browse search box before a keyword search box, so 
browses are encouraged by the UI.

That said, we did a study with our graduate students this year and they rely on 
browse searches for some of their academic work.  One is the use of subject and 
author browses, which lets  the student feel confident that they have been 
exhaustive in their searching in their area of research.  This can possibly be 
accommodated in other ways.  

In addition to known-item searching, our grad students also use title browse to 
be confident that we do _not_ own something.  In our relevance-ranked 
interface, sometimes the scholar may blame relevance ranking for hiding a 
title from them which we don't actually own.  It's an understandable reaction.

-Tod

Tod Olson t...@uchicago.edu
Systems Librarian
University of Chicago Library

On May 3, 2010, at 1:08 PM, Bill Dueber wrote:

 I got email from a person today saying, and I quote,
 
 I must say that [the lack of a browse interface] come as a shock (*which
 interface cannot browse??*)
 
 [Emphasis mine]
 
 Here, a browse interface is one where you can get a giant list of all the
 titles/authors/subjects whatever -- a view on the data devoid of any
 searching.
 
 Will those of you out there with browse interfaces in your system take a
 couple minutes to send along a guesstimate of what percentage of patron
 sessions involve their use?
 
 [Note that for right now, I'm excluding type-ahead search boxes although
 there's an obvious and, in my mind, strong argument to be made that they're
 substantially similar for many types of data]
 
 We don't have a browse interface on our (VuFind) OPAC right now. But in the
 interest of paying it forward, I can tell you that in Mirlyn, our OPAC, has
 numbers like this:
 
 Pct of Mirlyn sessions, Feb/March/April 2010, which included at least one
 basic
 search and also:
 
  Go to full record view  46% (we put a lot of info in search results)
  Select/favorite an item   15%
  Add a facet:13%
  Export record(s)
   to email/refworks/RIS/etc. 3.4%
  Send to phone (sms) 0.21%
  Click on faq/help/AskUs
 in footer0.17%  (324 total)
 
 Based on 187,784 sessions, 2010.02.01 to 2010.04.31
 
 So...anyone out there able to tell me anything about browse interfaces?
 
 -- 
 Bill Dueber
 Library Systems Programmer
 University of Michigan Library


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Karen Coyle

Quoting Beacom, Matthew matthew.bea...@yale.edu:



According to the report, 69 MARC tags occur in more than 1% of the   
records in WorldCat.  That is quite a few more than the Roy's 11,   
but even accounting for Karen's data elements being equivalent to   
the number of MARC sub-fields this is far fewer than the 1,000 data   
elements available to a cataloger in MARC.


So much depends on how you count things, so at the  
http://kcoyle.net/rda/ site I have put two MARC-related files. The  
first is just a list of elements (variable subfields) in alpha order  
with duplicates removed. Yes, I realize how imperfect this is, and  
that we will need to look beyond names to *meaning* of elements to  
determine what we really have. This file does not include indicators,  
and sometimes indicators really do create a separate element, like  
when person name becomes Family based on its indicator.


That file has over 560 entries.

The next file probably needs some more thought, but it is a list of  
the variable field indicators and subfields, leaving in subfields that  
are duplicated in different fields. I removed some of the numeric  
subfields that didn't seem to result in an actual elements (2, 3, 5,  
6, 8), but could be wrong about that. I also did not include  
indicators that are = Undefined. We can debate whether a personal  
name in an added entry is the same element as a personal name in a  
subject heading, and similarly for the various places where geographic  
names are used, titles, etc etc etc. This is the analysis that is  
needed to reduce MARC21 to a cleaner set of data elements.


That file has 1421 entries.

Neither of these contains any of the fixed field elements (many of  
which, IMO, should replace textual elements now carried in MARC21).  
When I looked at the fixed fields (and this is reported at  
http://futurelib.pbworks.com/Data+and+Studies), I came up with this  
count of *unique* fixed field elements (each with multiple values):


008 - 58
007 - 55

Each one of these should become a controlled value list in a SemWeb  
implementation of MARC. RDA appears to have a total of 68 defined  
value lists, but I don't believe that those include ones defined  
elsewhere, such as languages, country codes, etc.


kc

p.s. linked from that same page is the file I am using for this  
analysis, in CSV format, if anyone else wants to play with it. I have  
tried to keep it up to date with MARBI proposals.




Matthew Beacom


By the way, the descriptive fields used in more than 20% of the MARC  
 records in WorldCat are:


245 Title statement 100%
260 Imprint statement 96%
300 Physical description 91%
100 Main entry - personal name 61%
650 Subject added entry - topical term 46%
500 General note 44%
700 Added entry - personal name 28%

They answer, more or less, a few basic questions a user might have   
about the material:
What is it called? Who made it? When was it made? How big is it?   
What is it about? Answers to the question, How can I get it? are   
usually given in the associated MARC holdings record.



-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf  
 Of Roy Tennant

Sent: Monday, May 03, 2010 2:15 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MODS and DCTERMS

I would even argue with the statement very detailed, well over 1,000
different data elements, some well-coded data (not all). There are only 11
(yes, eleven) MARC fields that appear in 20% or more of MARC records
currently in WorldCat[1], and at least three of those elements are control
numbers or other elements that contribute nothing to actual description. I
would say overall that we would do well to not gloat about our metadata
until we've reviewed the facts on the ground. Luckily, now we can.
Roy

[1] http://www.oclc.org/research/publications/library/2010/2010-06.pdf

On Mon, May 3, 2010 at 11:03 AM, Eric Lease Morgan emor...@nd.edu wrote:


On May 3, 2010, at 1:55 PM, Karen Coyle wrote:

 1. MARC the data format -- too rigid, needs to go away
 2. MARC21 bib data -- very detailed, well over 1,000 different data
 elements, some well-coded data (not all); unfortunately trapped in #1



The differences between the two points enumerated above, IMHO, seem to be
the at the heart of the never-ending debate between computer types and
cataloger types when it comes to library metadata. The non-library computer
types don't appreciate the value of human-aided systematic description. And
the cataloger types don't understand why MARC is a really terrible bit
bucket, especially considering the current environment. All too often the
two camps don't know to what the other is speaking. MARC must die. Long
live MARC.

--
Eric Lease Morgan







--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234  
begin_of_the_skype_highlighting  1-510-435-8234  end_of_the_skype_highlighting

skype: kcoylenet


Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)

2010-05-03 Thread Karen Coyle

Quoting Jakob Voss jakob.v...@gbv.de:


I bet there are several reasons why OpenURL failed in some way but I
think one reason is that SFX got sold to Ex Libris. Afterwards there
was no interest of Ex Libris to get a simple clean standard and most
libraries ended up in buying a black box with an OpenURL label on it -
instead of developing they own systems based on a common standard. I
bet you can track most bad library standards to commercial vendors. I
don't trust any standard without open specification and a reusable Open
Source reference implementation.


For what it's worth, that does not coincide with my experience.

kc
--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] SRU/ZeeRex explain question : record schemas

2010-05-03 Thread Ray Denenberg, Library of Congress

From: Jonathan Rochkind rochk...@jhu.edu

But you will leave sorting as part of CQL too in any changes to CQL specs, 
I hope?  I think CQL has a lot of use even outside of SRU proper, so I 
encourage you to leave it's spec not too tightly coupled to SRU.


The OASIS TC firmly supports this approach (and by firmly I mean 100%) so 
the only way this could get changed is via public comment.





I think there are at least three ways to sort as part of (different 
versions of?) SRU now!

1) An actual separate sortKeys query paramater
2) Included in the CQL expression in query, using the sortBy keyword.
3) In draft not finalized, OASIS/SRU 2.0 methods of specifying XPaths for 
sorting.  [Thanks for including the link to the current SRU 2.0 draft, I 
didn't know that was publically available anywhere, it's not really 
googlable].


As you corrected yourself in a subsequent message:

Ah, I think I was wrong below. I must have been looking at different 
versions of the SRU spec without realizing it.


SRU 1.1 includes a sortKeys parameter, and CQL 1.1 does not include a 
sortBy clause.


SRU 1.2 does NOT include a sortKeys parameter, and CQL 1.2 does include 
a sortBy clause.


Yes, that's correct.


Do I have this right?  As SRU 1.2 is the only actual spec I have to work 
with... am I right that either top-level sortKeys, or embedded in CQL 
with sortBy would both be legal in SRU 1.2
No. Legal in 2.0 - the OASIS version, not legal in 1.2.   In 1.2 it is not 
legal to have a sort parameter in the request.


OASIS is standardizing SRU and CQL loosely coupled that is, SRU can use 
other query languages and CQL may be invoked by other protocols, but they 
are generally oriented towards being used together.   But since SRU may be 
used with a query language that might not have sort capability, the TC felt 
it necessary to include sorting as part of the protocol. Conversely since 
CQL may be used by a protocol that doesn't support sorting, similarly CQL 
should support sorting. There is a section in the draft standard that 
discusses what to do if a request has conflicting sort specifications.


--Ray


Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Roy Tennant
Thanks, Matthew, for a much more nuanced and accurate depiction of the data.
I would encourage anyone interested in this topic to spend some time with
this report, which was one result of a great deal of work by many people in
research institutions around the world. The findings and recommendations are
well worth your time.
Roy

On Mon, May 3, 2010 at 11:55 AM, Beacom, Matthew matthew.bea...@yale.eduwrote:

 Although I agree with Roy's suggestion that librarians not gloat about our
 metadata, the notion that the value of a data element can be elicited from
 the frequency of its use in the overall domain of library materials is
 misleading and contrary to the report Roy cites.

 The sub-section of the very useful and informative OCLC report that Roy
 cites is very good on this point. Section 2. MARC Tag Usage in WorldCat by
 Karen Smith-Yoshimura clearly lays out the data in the context of WorldCat
 and the cataloging practice of the OCLC members.

 Library holdings are dominated by texts and in terms of titles cataloged
 texts are dominated by books. This preponderance of books tilts the ratios
 of use per individual data elements. Many data elements pertain to either a
 specific form of material, manuscripts, for instance. Others pertain to
 specific content, musical notation, for instance. Some pertain to both,
 manuscript scores, for instance. Within the total aggregate of library
 materials, data elements that are specific per material or content do not
 rise in usage rates to anything near 20% of the aggregate total of titles.
 Yet these elements are necessary or valuable to those wishing to discover
 and use the materials, and when one recalls that 1% use rates in WorldCat
 equal about 1,000,000 titles the usefulness of many MARC data elements can
 be seen as widespread.

 According to the report, 69 MARC tags occur in more than 1% of the records
 in WorldCat.  That is quite a few more than the Roy's 11, but even
 accounting for Karen's data elements being equivalent to the number of MARC
 sub-fields this is far fewer than the 1,000 data elements available to a
 cataloger in MARC.

 Matthew Beacom


 By the way, the descriptive fields used in more than 20% of the MARC
 records in WorldCat are:

 245 Title statement 100%
 260 Imprint statement 96%
 300 Physical description 91%
 100 Main entry - personal name 61%
 650 Subject added entry - topical term 46%
 500 General note 44%
 700 Added entry - personal name 28%

 They answer, more or less, a few basic questions a user might have about
 the material:
 What is it called? Who made it? When was it made? How big is it? What is it
 about? Answers to the question, How can I get it? are usually given in the
 associated MARC holdings record.


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Roy Tennant
 Sent: Monday, May 03, 2010 2:15 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] MODS and DCTERMS

 I would even argue with the statement very detailed, well over 1,000
 different data elements, some well-coded data (not all). There are only 11
 (yes, eleven) MARC fields that appear in 20% or more of MARC records
 currently in WorldCat[1], and at least three of those elements are control
 numbers or other elements that contribute nothing to actual description. I
 would say overall that we would do well to not gloat about our metadata
 until we've reviewed the facts on the ground. Luckily, now we can.
 Roy

 [1] http://www.oclc.org/research/publications/library/2010/2010-06.pdf

 On Mon, May 3, 2010 at 11:03 AM, Eric Lease Morgan emor...@nd.edu wrote:

  On May 3, 2010, at 1:55 PM, Karen Coyle wrote:
 
   1. MARC the data format -- too rigid, needs to go away
   2. MARC21 bib data -- very detailed, well over 1,000 different data
   elements, some well-coded data (not all); unfortunately trapped in #1
 
 
 
  The differences between the two points enumerated above, IMHO, seem to be
  the at the heart of the never-ending debate between computer types and
  cataloger types when it comes to library metadata. The non-library
 computer
  types don't appreciate the value of human-aided systematic description.
 And
  the cataloger types don't understand why MARC is a really terrible bit
  bucket, especially considering the current environment. All too often the
  two camps don't know to what the other is speaking. MARC must die.
 Long
  live MARC.
 
  --
  Eric Lease Morgan
 



Re: [CODE4LIB] MODS and DCTERMS

2010-05-03 Thread Eric Lease Morgan
On May 3, 2010, at 2:47 PM, Aaron Rubinstein wrote:

 1. MARC the data format -- too rigid, needs to go away
 2. MARC21 bib data -- very detailed, well over 1,000 different data
 elements, some well-coded data (not all); unfortunately trapped in #1
 
 For the sake of my own understanding, I would love an explanation of the 
 distinction between #1 and #2...


Item #1

The first item (#1) is MARC, the data structure -- a container for holding 
various types of bibliographic information. From one of my older publications 
[1]:

  ...the MARC record is a highly structured piece of information.
  It is like a sentence with a subject, predicate, objects,
  separated with commas, semicolons, and one period. In data
  structure language, the MARC record is a hybrid sequential/random
  access record.

  The MARC record is made up of three parts: the leader, the
  directory, the bibliographic data. The leader (or subject in our
  analogy) is always represented by the first 24 characters of each
  record. The numbers and letters within the leader describe the
  record's characteristics. For example, the length of the record
  is in positions 1 to 5. The type of material the record
  represents (authority, bibliographic, holdings, et cetera) is
  signified by the character at position 7. More importantly, the
  characters from positions 13 to 17 represent the base. The base
  is a number pointing to the position in the record where the
  bibliographic information begins.
  
  The directory is the second part of a MARC record. (It is the
  predicate in our analogy.) The directory describes the record's
  bibliographic information with directory entries. Each entry
  lists the types of bibliographic information (items called
  tags), how long the bibliographic information is, and where the
  information is stored in relation to the base. The end of the
  directory and all variable length fields are marked with a
  special character, the ASCII character 30.
  
  The last part of a MARC record is the bibliographic information.
  (It is the object in our sentence analogy.) It is simply all the
  information (and more) on a catalog card. Each part of the
  bibliographic information is separated from the rest with the
  ASCII character 30. Within most of the bibliographic fields are
  indicators and subfields describing in more detail the fields
  themselves. The subfields are delimited from the rest of the
  field with the ASCII character 31.
  
  The end of a MARC record is punctuated with an end-of-record
  mark, ASCII character 29. The ASCII characters 31, 30, and 29
  represent our commas, semicolons, and periods, respectively.

At the time, MARC -- the data structure -- was really cool. Consider the 
environment in 1965. No hard disks. Tape drives instead. Data storage was 
expensive. The medium had to be read from beginning to end. No (or rarely any) 
sequential data access. Thus, the record and field lengths were relatively 
short. (No MARC record can be longer 99,999 characters, and no MARC field can 
be longer than 999 characters.) Remember too the purpose of MARC -- to transmit 
the content of catalog cards. Given the leader, the directory, and the 
bibliographic sections of a MARC record all preceded by pseudo checksums and 
delimited by non-printable ASCII characters, the MARC record -- the data 
structure comes with a plethora of check and balances. Very nice.

Fast forward to the present day. Disk space is cheap. Tapes are not the norm. 
More importantly the wider computing environment uses XML as their data 
structure of choice. If libraries are about sharing information, then we need 
to communicate to them in their language. The language of the Net is XML not 
MARC. Not only is MARC -- the data structure -- stuck on 50 year-old 
technology, but more importantly it is not the language of the people to whom 
we want to share.


Item #2

Our bibliographic data (item #2) is the metadata of the Web. While it is 
important, and it adds a great deal of value, it is not as important as it used 
to be. It too needs to change. Remember, MARC was originally designed to print 
catalog cards. Author. Title. Pagination. Series. Notes. Subject headings. 
Added entries. Looking back, these were relatively simple data elements, but 
what about system numbers? ISBN numbers? Holdings information? Tables of 
contents? Abstracts? Ratings? We have stuffed these things into MARC every 
which way and we call MARC flexible.

More importantly, and as many have said previously, string values in MARC 
records lead to maintenance nightmares. Instead, like a relational database 
model, values need to be described using keys -- pointers -- to the canonical 
values. This makes find/replace operations painless, enables for the use of 
different languages, as well as numerous other advantages.

ISBD is also a pain. Take the following string:

  Kilgour, Frederick Gridley (1914–2006)

There is way too much punctuation going on here. Yes, 

Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 7:10 PM, Bryan Baldus
bryan.bal...@quality-books.com wrote:
 I can't speak for other users (particularly the generic patron user type), 
 but as a
cataloger/librarian user,

...and THERE IT IS, ladies and gentlemen.

I've started trying to keep a list of IP addresses I *know* are staff
and separate out the statistics. The OPAC isn't for the librarians;
the ILS client is. If the client sucks so badly that librarians need
the OPAC to do our job (as I was told several times during our roll
out of vufind), then the solution is to fix the client, or
(alternately) build up a workaround for staff. NOT to overload the
OPAC.  If librarians need specialized tools, let's just build them
without some sort of pretense that they're anything but the tiniest
blip on the bell curve of patrons.

And, BTW, just because you (and you know who you are!) do 8 hours of
reference desk work a week doesn't mean you have a hell of a lot more
insight. The patrons that self-select to actually speak to a librarian
sitting *in the library* are a freakshow themselves, statistically
speaking.

[Not meaning to imply that Bryan doesn't know the difference between
himself and a normal patron; his post makes it clear that he does. I
just took the opportunity to rant.]

I'm not saying that patrons don't use browse much (that's what I'm
trying to determine). But, to borrow from the 2009 code4lib
conference, every time a librarian's work habits inform the design of
a public-facing application, God kills a kitten.

  -Bill-

-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 6:34 PM, Karen Coyle li...@kcoyle.net wrote:
 Quoting Jakob Voss jakob.v...@gbv.de:

 I bet there are several reasons why OpenURL failed in some way but I
 think one reason is that SFX got sold to Ex Libris. Afterwards there
 was no interest of Ex Libris to get a simple clean standard and most
 libraries ended up in buying a black box with an OpenURL label on it -
 instead of developing they own systems based on a common standard. I
 bet you can track most bad library standards to commercial vendors. I
 don't trust any standard without open specification and a reusable Open
 Source reference implementation.

 For what it's worth, that does not coincide with my experience.


I'm going to turn this back on Karen and say that much of my pain does
come from vendors, but it comes from their shitty data. OpenURL and
resolvers would be a much more valuable piece of technology if the
vendors would/could get off their collective asses(1) and give us
better data.

 -Bill-

(1) By this, of course, I mean if the librarians would grow a pair
and demand better data via our contracts


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)

2010-05-03 Thread Jonathan Rochkind
When it's actually a reference librarian using it for reference/research tasks, 
I think it can be a legitimate use case -- so long as you remember that it is 
representative of only a certain type of expert searcher (not neccesarily 
even every searcher requiring sophisticated or complex features, just a certain 
type with certain tasks), which represents a minority of searchers, and don't 
over-emphasize it's importance beyond it's actual representativeness -- don't 
sacrifice the needs of the majority of users for a minority. 

When the tasks are related to cataloging and assigning headings -- absolutely 
and completely agree with Bill, this is not an appropriate use case for a 
public interface, I agree. 

So, Bill, you're still not certain yourself exactly what purposes browse is 
used for by actual non-librarian searchers, if anything?

Jonathan

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Bill Dueber 
[b...@dueber.com]
Sent: Monday, May 03, 2010 8:28 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] A call for your OPAC (or other system) statistics! 
(Browse interfaces)

On Mon, May 3, 2010 at 7:10 PM, Bryan Baldus
bryan.bal...@quality-books.com wrote:
 I can't speak for other users (particularly the generic patron user type), 
 but as a
cataloger/librarian user,

...and THERE IT IS, ladies and gentlemen.

I've started trying to keep a list of IP addresses I *know* are staff
and separate out the statistics. The OPAC isn't for the librarians;
the ILS client is. If the client sucks so badly that librarians need
the OPAC to do our job (as I was told several times during our roll
out of vufind), then the solution is to fix the client, or
(alternately) build up a workaround for staff. NOT to overload the
OPAC.  If librarians need specialized tools, let's just build them
without some sort of pretense that they're anything but the tiniest
blip on the bell curve of patrons.

And, BTW, just because you (and you know who you are!) do 8 hours of
reference desk work a week doesn't mean you have a hell of a lot more
insight. The patrons that self-select to actually speak to a librarian
sitting *in the library* are a freakshow themselves, statistically
speaking.

[Not meaning to imply that Bryan doesn't know the difference between
himself and a normal patron; his post makes it clear that he does. I
just took the opportunity to rant.]

I'm not saying that patrons don't use browse much (that's what I'm
trying to determine). But, to borrow from the 2009 code4lib
conference, every time a librarian's work habits inform the design of
a public-facing application, God kills a kitten.

  -Bill-

--
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: [CODE4LIB] A call for your OPAC (or other system) statistics! (Browse interfaces)

2010-05-03 Thread Bill Dueber
On Mon, May 3, 2010 at 8:39 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 So, Bill, you're still not certain yourself exactly what purposes browse is 
 used for by actual non-librarian searchers, if anything?

Right. I'm not sure *the extent* to which it's used (data which are
necessarily going to be messy and partially driven by how prevalent
browse vs search are in the interface), and I certainly don't know
what's going through people's heads when they choose to use it (on
those occasions when they make a conscious choice to use browse in
addition to/instead of  search).

My attempts to find stuff in the research literature failed me; if
anyone has other pointers, I'd love to read them! (If only there was a
real librarian around to help poor little me...)

 -Bill-


Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)

2010-05-03 Thread stuart yeates

Bill Dueber wrote:


if the librarians would grow a pair
and demand better data via our contracts


While I agree with your overall point, it would have been better made 
with the gendered phrasing, in my view.


cheers
stuart
--
Stuart Yeates
http://www.nzetc.org/   New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/ Institutional Repository


Re: [CODE4LIB] OpenURL and DAIA

2010-05-03 Thread Markus Fischer
We are just starting to use DAIA for a small holdings register of 
journals holdings in connection with Vufind and the new DAIA-Driver in 
Vufind.


Since the holdings register is not a big union-catalog, but rather a 
simple database in which you simply mark which Journal (ISSN) you have 
for which periode, we do send the requests by OpenURL, do some 
ISSN-Mapping and send back DAIA-responses.


We will use that in connection with an open and cooperative reference 
database for nursing literature.


DAIA works very fine for us.

There should perhaps be an official way to request subsets of holdings 
and transport some information e.g. about ILL fees in DAIA (probably you 
can for the later in the limitation tag?).


But we work around by combining that with IP-based requests. So we can 
do crazy stuff like showing institution specific availability in the 
overview of a search, and showing general availability in the details of 
a record.


I think Jakob created with DAIA a simple an lightweighted solution to a 
real problem in the library world.


Markus

Jakob Voss schrieb:

Owen wrote:

Although part of the problem is that you might want to offer any 
service on
the basis of an OpenURL the major use case is supply of a document 
(either

online or via ILL) - so it strikes me you could look at DAIA
http://www.gbv.de/wikis/cls/DAIA_-_Document_Availability_Information_API 
?

Jakob does this make sense?


Just having read Joel Spolsky's article about Architecture Astronauts 
that Mike pointed to [1] I hesitate to propagate what you can all do 
with DAIA. But your use case makes sense if you want to offer services 
provided or mediated by a specific institution (such as a library) with 
a specific publication.


Inspired by your idea to combine OpenURL and DAIA I update the DAIA perl 
library [2] and hacked a DAIA server that also understands some very 
limited OpenURL (it only knows books with ISBN):


You can look up which library has a specific publication in the GBV 
library union by its identifier:


http://ws.gbv.de/daia/gvk/?id=gvk:ppn:48574418X

or by OpenURL

http://ws.gbv.de/daia/gvk/?ctx_ver=Z39.88-2004rft_val_fmt=info:ofi/fmt:kev:mtx:bookrft.isbn=0-471-38393-7 



Have a look at the simple source code of this script at

http://daia.svn.sourceforge.net/viewvc/daia/trunk/daiapm/examples/gvk.pl?view=markup 



I want to stress that this demo DAIA server does not use the full 
expression power of DAIA, in fact it does not provide any availability 
information at at - but you hopefully get the concept.


Cheers
Jakob

[1] http://www.joelonsoftware.com/articles/fog18.html
[2] https://sourceforge.net/projects/daia/files/DAIA-0.27.tar.gz