Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Tue, Mar 1, 2011 at 11:14 PM, Roy Tennant roytenn...@gmail.com wrote:
 On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back god...@gmail.com wrote:

Similarly, the date associated with a record can come in a variety of
formats. Some are single-field (20080901), some are abbreviated
(200811), some are separated into year, month, date, etc.  Some
records have a mixture of those.

 In this world of MARC (s/MARC/hurt) I call that an embarrassment of
 riches. I've spent some bit of time parsing MARC, especially lately,
 and just the fact that Summon provides a normalized date element is
 HUGE.

That's great to hear - but how do I know which elements to use?

For instance, look at the JSON excerpt at
http://api.summon.serialssolutions.com/help/api/search/response/documents

 PublicationDateCentury:[
  1900
],
PublicationDateDecade:[
  1970
],
PublicationDateYear:[
  1979
],
PublicationDate:[
  1979.
],
PublicationDate_xml:[
  {
day:01,
month:01,
text:1979.,
year:1979
  }
],

Which one is the cleaned up date, and in which order shall I be
looking for the date field in the record when some or all of this
information is missing in a particular record?

Andrew responded to that if given, PublicationDate_xml is the
preferred one - but this raises the question which field in
PublicationDate_xml to use: .text, .day, or .year?  What if some are
missing?
What if PublicationDate_xml is missing, then I use or look for
PublicationDate?  Or is PublicationDateYear/Month/Decade preferred to
PublicationDate?  Which fields are derived from which others?

These are the types of questions I'm looking to answer.

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Roy Tennant
Godmar,
I'm surprised you're asking this. Most of the questions you want
answered could be answered by a basic programming construct: an
if-then-else statement and a simple decision about what you want to
use in your specific application (for example, do you prefer text
with the period, or not?). About the only question that such a
solution wouldn't deal with is which fields are derived from which
others, which strikes me as superfluous to your application if you
know a hierarchy of preference. But perhaps I'm missing something
here.
Roy

On Wed, Mar 2, 2011 at 7:39 AM, Godmar Back god...@gmail.com wrote:
 On Tue, Mar 1, 2011 at 11:14 PM, Roy Tennant roytenn...@gmail.com wrote:
 On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back god...@gmail.com wrote:

Similarly, the date associated with a record can come in a variety of
formats. Some are single-field (20080901), some are abbreviated
(200811), some are separated into year, month, date, etc.  Some
records have a mixture of those.

 In this world of MARC (s/MARC/hurt) I call that an embarrassment of
 riches. I've spent some bit of time parsing MARC, especially lately,
 and just the fact that Summon provides a normalized date element is
 HUGE.

 That's great to hear - but how do I know which elements to use?

 For instance, look at the JSON excerpt at
 http://api.summon.serialssolutions.com/help/api/search/response/documents

     PublicationDateCentury:[
      1900
    ],
    PublicationDateDecade:[
      1970
    ],
    PublicationDateYear:[
      1979
    ],
    PublicationDate:[
      1979.
    ],
    PublicationDate_xml:[
      {
        day:01,
        month:01,
        text:1979.,
        year:1979
      }
    ],

 Which one is the cleaned up date, and in which order shall I be
 looking for the date field in the record when some or all of this
 information is missing in a particular record?

 Andrew responded to that if given, PublicationDate_xml is the
 preferred one - but this raises the question which field in
 PublicationDate_xml to use: .text, .day, or .year?  What if some are
 missing?
 What if PublicationDate_xml is missing, then I use or look for
 PublicationDate?  Or is PublicationDateYear/Month/Decade preferred to
 PublicationDate?  Which fields are derived from which others?

 These are the types of questions I'm looking to answer.

  - Godmar



Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Wed, Mar 2, 2011 at 11:12 AM, Roy Tennant roytenn...@gmail.com wrote:
 Godmar,
 I'm surprised you're asking this. Most of the questions you want
 answered could be answered by a basic programming construct: an
 if-then-else statement and a simple decision about what you want to
 use in your specific application (for example, do you prefer text
 with the period, or not?). About the only question that such a
 solution wouldn't deal with is which fields are derived from which
 others, which strikes me as superfluous to your application if you
 know a hierarchy of preference. But perhaps I'm missing something
 here.

I'm not asking how to code it, I'm asking for the algorithm I should
use, given the fact that I'm not familiar with the provenance and
status of the data Summon returns (which, I understand, is a mixture
of original, harvested data, and cleaned-up, processed data.)

Can you suggest such an algorithm, given the fact that each of the 8
elements I showed in the example (PublicationDateYear,
PublicationDateDecade, PublicationDate, PublicationDateCentury,
PublicationDate_xml.text, PublicationDate_xml.day,
PublicationDate_xml.month, PublicationDate_xml.year is optional?  But
wait  I think I've also seen records where there is a
PublicationDateMonth, and records where some values have arrays of
length  1.

Can you suggest, or at least outline, such an algorithm?

It would be helpful to know, for instance, if the presence of a
PublicationDate_xml field supplants any other PublicationDate* fields
(does it?)  If a PublicationDate_xml field is absent, which field
would I want to look at next?  Is PublicationDate more reliable than a
combination of PublicationDateYear and PublicationDateMonth (and
perhaps PublicationDateDay if it exists?)?

If the PublicationDate_xml is present, then: should I prefer the .text
option?  What's the significance of that dot? Is it spurious, like the
identifier you mentioned you find in raw MARC records?  If not, what,
if anything, is known about the presence of the other fields?  What if
multiple fields are given in an array?  Is the ordering significant
(e.g., the first one is more trustworthy?) Or should I sort them based
on a heuristics?  (e.g., if 20100523 and 201005 is given, prefer
the former?)  What if the data is contradictory?

These are the questions I'm seeking answers to; I know that those of
you who have coded their own Summon front-ends must have faced the
same questions when implementing their record displays.

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Wed, Mar 2, 2011 at 11:36 AM, Walker, David dwal...@calstate.edu wrote:
 Just out of curiosity, is there a Summon (API) developer listserv?  Should 
 there be?

Yes, there is - I'm waiting for my subscription there to be approved.

Like I said at the beginning of this thread, this is only tangentially
a Code4Lib issue, and certainly the details aren't.  But perhaps the
general problem is (?)

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Walker, David
Just out of curiosity, is there a Summon (API) developer listserv?  Should 
there be?

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Godmar Back 
[god...@gmail.com]
Sent: Wednesday, March 02, 2011 8:30 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] dealing with Summon

On Wed, Mar 2, 2011 at 11:12 AM, Roy Tennant roytenn...@gmail.com wrote:
 Godmar,
 I'm surprised you're asking this. Most of the questions you want
 answered could be answered by a basic programming construct: an
 if-then-else statement and a simple decision about what you want to
 use in your specific application (for example, do you prefer text
 with the period, or not?). About the only question that such a
 solution wouldn't deal with is which fields are derived from which
 others, which strikes me as superfluous to your application if you
 know a hierarchy of preference. But perhaps I'm missing something
 here.

I'm not asking how to code it, I'm asking for the algorithm I should
use, given the fact that I'm not familiar with the provenance and
status of the data Summon returns (which, I understand, is a mixture
of original, harvested data, and cleaned-up, processed data.)

Can you suggest such an algorithm, given the fact that each of the 8
elements I showed in the example (PublicationDateYear,
PublicationDateDecade, PublicationDate, PublicationDateCentury,
PublicationDate_xml.text, PublicationDate_xml.day,
PublicationDate_xml.month, PublicationDate_xml.year is optional?  But
wait  I think I've also seen records where there is a
PublicationDateMonth, and records where some values have arrays of
length  1.

Can you suggest, or at least outline, such an algorithm?

It would be helpful to know, for instance, if the presence of a
PublicationDate_xml field supplants any other PublicationDate* fields
(does it?)  If a PublicationDate_xml field is absent, which field
would I want to look at next?  Is PublicationDate more reliable than a
combination of PublicationDateYear and PublicationDateMonth (and
perhaps PublicationDateDay if it exists?)?

If the PublicationDate_xml is present, then: should I prefer the .text
option?  What's the significance of that dot? Is it spurious, like the
identifier you mentioned you find in raw MARC records?  If not, what,
if anything, is known about the presence of the other fields?  What if
multiple fields are given in an array?  Is the ordering significant
(e.g., the first one is more trustworthy?) Or should I sort them based
on a heuristics?  (e.g., if 20100523 and 201005 is given, prefer
the former?)  What if the data is contradictory?

These are the questions I'm seeking answers to; I know that those of
you who have coded their own Summon front-ends must have faced the
same questions when implementing their record displays.

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Demian Katz
 These are the questions I'm seeking answers to; I know that those of
 you who have coded their own Summon front-ends must have faced the
 same questions when implementing their record displays.

Feel free to refer to VuFind's Summon template for reference if that is helpful:

https://vufind.svn.sourceforge.net/svnroot/vufind/trunk/web/interface/themes/default/Summon/record.tpl

Andrew wrote this originally, and I've tweaked it in a few places to address 
problems as they arose.  I don't claim that this offers the definitive answer 
to your questions...  but it's working reasonably well for us so far.

- Demian


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Ed Summers
On Wed, Mar 2, 2011 at 11:38 AM, Godmar Back god...@gmail.com wrote:
 Like I said at the beginning of this thread, this is only tangentially
 a Code4Lib issue, and certainly the details aren't.  But perhaps the
 general problem is (?)

More than anything this seems like a documentation issue. From my seat
in the peanut gallery it seems like Godmar should be able to answer
these sorts of questions by looking at the Summon Search API
Documentation [1] for responses (which is quite nice btw).

Oh, and I think it's great to see this thread on code4lib, where other
people have been known to create an API or three. So thanks Godmar,
for asking here...

//Ed

[1] http://api.summon.serialssolutions.com/help/api/search/response


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Eric Lease Morgan
On Mar 2, 2011, at 12:22 PM, Ed Summers wrote:

 Oh, and I think it's great to see this thread on code4lib, where other
 people have been known to create an API or three. So thanks Godmar,
 for asking here...



I concur. I hope others more or less feel comfortable discussing 
product-specific issues on Code4Lib. Such discussions have more things in 
common than differences.

-- 
Eric Morgan
University of Notre Dame


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread LeVan,Ralph
Yes, the draft version of SRU 2.0 does include support for facets.  The 
functionality is based on the SOLR documentation of facets with perhaps some 
slight simplification.  None of the editors of the standard are active facet 
users, so comments on that feature in our draft would be appreciated. (I'm 
afraid I'm responsible for that work.  Personally, I found the SOLR 
functionality massively over-engineered and hope someone will recommend 
simplification.)

All the current draft documentation is available at 
http://www.loc.gov/standards/sru/oasis/.

Ralph

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Karen Coombs
 Sent: Wednesday, March 02, 2011 12:45 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] dealing with Summon
 
 I believe that there has been discussion of adding facets to SRU
 responses in the past. It may even be part of the standard now I'm not
 sure.
 
 Facets in an SRU and/or Atom response would certainly be of interest
 to OCLC. Another area where it might be nice to consider collaborating
 is on a format for these records that is non-library developer
 friendly but rich enough to provide the appropriate metadata. If
 you've used WorldCat Search API you'll know that as a developer your
 caught between the complexity of MARC and the simplicity (but lack of
 richness) of Dublin Core/Atom/RSS.
 
 Is there a middle ground metadata format that developers would prefer
 to see output?
 
 Karen
 
 On Tue, Mar 1, 2011 at 1:36 PM, Andrew Nagy asn...@gmail.com wrote:
  Hi Godmar - to help answer some of your questions about the fields - I can
  help address those directly.  Though it would be interesting to hear
  experiences from others who are working from APIs to search systems such
 as
  Summon or others.
 
  In regards to the publication date - the Summon API has the raw date
  (which comes directly from the content provider), but we also provide a
  field with a microformat containing the parsed and cleaned date that Summon
  has generated.  We advise for you to use our parsed and cleaned date rather
  than the raw date.  The URL and URI fields are similar, the URL is the link
  that we have generated - the URI is what is provided by the content
  provider.  In your case, you appear to be referring to OPAC records, so the
  URI is the ToC that came from the 856$u field in your MARC records.  The
 URL
  is a link to the record in the OPAC.
 
  If you need more assistance around the fields that are available via Summon,
  I'd be happy to take this conversation off-list.
 
  I think an interesting conversation for the Code4Lib community would be
  around a standardized approach for an API that meets both the needs of the
  library developer and the product vendor.  I recall a brief chat I had with
  Annette about this same topic at a NISO conference in Boston a while back.
  For example, we have SRU/W, but that does not provide support for all of the
  features that a search engine would need (ie. facets, spelling corrections,
  recommendations, etc.).  Maybe a new standard is needed - or maybe
 extending
  an existing one would solve this need?  I'm all ears if you have any ideas.
 
  Andrew
 
 
  On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back god...@gmail.com wrote:
 
  Hi -
 
  this is a comment/question about a particular discovery system
  (Summon), but perhaps of more general interest. It's not intended as
  flamebait or criticism of the vendor or people associated with it.
 
  When integrating Summon into LibX (which works quite nicely btw,
  gratuitous screenshot is attached to this email) I found myself amazed
  by the multitude of possible fields and combinations returned in the
  resulting records. For instance, some records contains fields 'url'
  (lower case), and/or 'URL' (upper case), and/or 'URI' (upper case).
  Which one to display, and how?  For instance, some records contain an
  OPAC URL in the 'url' field, and a ToC link in the URI field. Why?
 
  Similarly, the date associated with a record can come in a variety of
  formats. Some are single-field (20080901), some are abbreviated
  (200811), some are separated into year, month, date, etc.  Some
  records have a mixture of those.
 
  My question is how do other adopters of Summon, or of emerging
  discovery systems that provide direct access to their records in
  general, deal with the roughness of the records being returned?  Are
  there best practices in how to extract information from them, and in
  how to prioritize relevant and weed out irrelevant or redundant
  information?
 
   - Godmar
 
 


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Godmar Back
On Wed, Mar 2, 2011 at 11:54 AM, Demian Katz demian.k...@villanova.edu wrote:
 These are the questions I'm seeking answers to; I know that those of
 you who have coded their own Summon front-ends must have faced the
 same questions when implementing their record displays.

 Feel free to refer to VuFind's Summon template for reference if that is 
 helpful:

 https://vufind.svn.sourceforge.net/svnroot/vufind/trunk/web/interface/themes/default/Summon/record.tpl

 Andrew wrote this originally, and I've tweaked it in a few places to address 
 problems as they arose.  I don't claim that this offers the definitive answer 
 to your questions...  but it's working reasonably well for us so far.


Ah, thanks.  As they say, a piece of code speaks a thousand words!

So, to solve the conundrum: only PublicationDate_xml and
PublicationDate are of interest. If the former is given, use it and
print (if available) its .month, .day, and .year fields. Else, if the
latter is given, just print it.
Ignore all other date-related fields. Ignore PublicationDate_xml.text.
 Ignore if there's more than one date field - use the first one.

This knowledge will also help me avoid sending unnecessary data to the
LibX client. As you know, Summon requires a proxy that talks to the
actual service, and cutting out redundant and derived fields at the
proxy could save a fair amount of bandwidth (though I'll have to check
if it also shaves off latency.) A typical search response (raw JSON,
with 20 hits) is  500KB long, so investing computing time at the
proxy in cutting this down may be promising.

 - Godmar


Re: [CODE4LIB] dealing with Summon

2011-03-02 Thread Demian Katz
 So, to solve the conundrum: only PublicationDate_xml and
 PublicationDate are of interest. If the former is given, use it and
 print (if available) its .month, .day, and .year fields. Else, if the
 latter is given, just print it.
 Ignore all other date-related fields. Ignore PublicationDate_xml.text.
  Ignore if there's more than one date field - use the first one.

Like I said, I don't claim that this is the one and only correct way to handle 
the data...  but you've correctly described what we're doing in VuFind, and so 
far nobody has complained about it!

- Demian


Re: [CODE4LIB] dealing with Summon

2011-03-01 Thread Andrew Nagy
Hi Godmar - to help answer some of your questions about the fields - I can
help address those directly.  Though it would be interesting to hear
experiences from others who are working from APIs to search systems such as
Summon or others.

In regards to the publication date - the Summon API has the raw date
(which comes directly from the content provider), but we also provide a
field with a microformat containing the parsed and cleaned date that Summon
has generated.  We advise for you to use our parsed and cleaned date rather
than the raw date.  The URL and URI fields are similar, the URL is the link
that we have generated - the URI is what is provided by the content
provider.  In your case, you appear to be referring to OPAC records, so the
URI is the ToC that came from the 856$u field in your MARC records.  The URL
is a link to the record in the OPAC.

If you need more assistance around the fields that are available via Summon,
I'd be happy to take this conversation off-list.

I think an interesting conversation for the Code4Lib community would be
around a standardized approach for an API that meets both the needs of the
library developer and the product vendor.  I recall a brief chat I had with
Annette about this same topic at a NISO conference in Boston a while back.
For example, we have SRU/W, but that does not provide support for all of the
features that a search engine would need (ie. facets, spelling corrections,
recommendations, etc.).  Maybe a new standard is needed - or maybe extending
an existing one would solve this need?  I'm all ears if you have any ideas.

Andrew


On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back god...@gmail.com wrote:

 Hi -

 this is a comment/question about a particular discovery system
 (Summon), but perhaps of more general interest. It's not intended as
 flamebait or criticism of the vendor or people associated with it.

 When integrating Summon into LibX (which works quite nicely btw,
 gratuitous screenshot is attached to this email) I found myself amazed
 by the multitude of possible fields and combinations returned in the
 resulting records. For instance, some records contains fields 'url'
 (lower case), and/or 'URL' (upper case), and/or 'URI' (upper case).
 Which one to display, and how?  For instance, some records contain an
 OPAC URL in the 'url' field, and a ToC link in the URI field. Why?

 Similarly, the date associated with a record can come in a variety of
 formats. Some are single-field (20080901), some are abbreviated
 (200811), some are separated into year, month, date, etc.  Some
 records have a mixture of those.

 My question is how do other adopters of Summon, or of emerging
 discovery systems that provide direct access to their records in
 general, deal with the roughness of the records being returned?  Are
 there best practices in how to extract information from them, and in
 how to prioritize relevant and weed out irrelevant or redundant
 information?

  - Godmar



Re: [CODE4LIB] dealing with Summon

2011-03-01 Thread Roy Tennant
 On Tue, Mar 1, 2011 at 2:14 PM, Godmar Back god...@gmail.com wrote:

Similarly, the date associated with a record can come in a variety of
formats. Some are single-field (20080901), some are abbreviated
(200811), some are separated into year, month, date, etc.  Some
records have a mixture of those.

In this world of MARC (s/MARC/hurt) I call that an embarrassment of
riches. I've spent some bit of time parsing MARC, especially lately,
and just the fact that Summon provides a normalized date element is
HUGE. That potentially takes that load off of your application, should
it be forced to absorb native MARC. This is the old Garbage In/Garbage
Out (GIGO) issue. Case in point: just today I discovered that we have
at least 1,600 856 fields in Worldcat that have the pipe symbol | in
the second indicator position instead of the numeral one. Right, that
means there are rocket scientists who thought the documentation was
indiicating a pipe symbol in that position.

We then have granularity issues, punctuation issues, and variance in
practice. And that's just for starters. Huge props to Summon for
trying to tackle some of these things, as we are attempting to do as
well.
Roy