Re: JSON for MARC

Alexander Wagner Thu, 20 Sep 2012 01:19:31 -0700

On 20.09.2012 09:09, Jerome Caffaro wrote:

Hello Jerome!

I just wanted to take the opportunity to discuss a side-comment about
"Marc-Based JSONs" you made on a different thread ("A propos de
Z39.50". Starting a new thread as I am deviating from the original
topic):


No problem. :)

We are currently using this in our websubmit autofill functions, if it's
of any help we could pretty quickly drop that to another user.

However, due to it's history this is one of our Perl backends and I
didn't implement it compeltely generic but for PICA systems. It works
well, heavily tested with the GVK Union Catalogue in Göttingen. Spitting
out nice Marc-Based JSONs. (Could even be hooked up as a CGI or via our
gateway module.)


How does this MARC JSON looks like? How "nice" is it?


Depends on your definition of "nice", I'd say. ;)

First of all, we use JSON returns mainly in websubmit based
backend functions, most of the time containing only a stub
of the full marc record. A common example is importing
external metadata, say from arXiv, crossref or pubmed into
the system. (This would model the Z39.50 case in the
original mail.) This thus comes along with some
"restrictions" so to say. Besideds: initially, I didn't your
broad research on suggestions as outlined here

Internally we have been briefly musing about the various ways to encode
MARC in JSON. It would be interesting to know which path you have
chosen.

For the record here are some possibilities explored:

[...]

We just modeled it to our needs with one goal having a
small(!) format and not these chatty pseudo-xml-things.
Otherwise we could just stick to XML anyway. This stems form
the larger lists used as of results for autosuggests, eg.
(Imagine: you have some 32.000 Journal titles in your db and
someone just keys in phy...)

This requirement also dropped what you call "user friendly
alternatives" like using some tags that have no close
associations with the Marc expression. Here I am librarian:
Marc is perfect, no need to use text that do not map to a
tag without disambiguities. Therefore 1001_a is much
preferred to something like "first_author_full_name" like
things. It is unique, it is short and I can easily look up
it's precise meaning at LoC. ;)

Second we have some restrictions in websubmit insofar as we
need to pass on structured returns in html forms. Say we
have an input field like "journal name", being an
autosuggest. You key in "Phys Rev D" get a list and select
"Physical Review D". However, in the backend this writes a
bunch of additional data to our records, e.g. the DDC of
this journal, it's unique ID, ISSN, Publisher, Place,
statistic keys like "listed in JCR, SCI, SCIEXP, <and what
not>". Due to websubmit we need to pass a structure within a
text. (This would probably be something that is "not nice",
but well.)

Then we need to pass on some non-marcish data. If you think
of the autosuggest: the label presented to the user (surely
not the full marc ;).

And finally we want as easy as possible and as short and
simple as possible JS-code as JS is always PITA, even using
things like JQUERY. (Again, something why we use user
friendly Marc tags ;)

What we came up thus, was mainly inspired also by Invenios
internal representation of Marc, so we might not be entirely
general in this regards. Especially we do not model some
real life (but quite esotheric) librarians cases. (We
definitely can not model the correct Marc Coding for these
beasts: http://gso.gbv.de/DB=2.1/PPNSET?PPN=623521288;
though it takes a librarian half a day to just catalogue
them propery as well ;)

Keeping all this in mind resulted in the follwoing simple
rules for our JSON expressions:

- Use Hashes

- Key them as Marc-fields and prepend them by an "I" (for
  input) to have valid tags in JS and HTML. So Marc 100 with
  indicators 1 and blank is noted as

  I1001_

  This maps something like ^I[0-9]{3}

- Make the values valid strings: if they contain themselvs
  JSONisch (sub-)structures escape the quotes properly.

- As long as a field is unique give the full Marc tag, ie.
  field and subfield. E.g.

  "I1001_a" : "Wagner, Alexander"

- Use Hashes of Hashes if you want to pass on structured
  data. Here only give only the field, no subfields, and use
  subfields as keys of the following hash:

  "I1001_" : {"a" : "Wagner, Alexander",
             "0": "P:(DE-Juel1)133832" }

  (Unescaped version shown.)

- Use Hashes of Arrays of Hashes if you have repeatable
  fields, e.g.

  "I0247_" : "[
     {\"2\" : \"doi\",  \"a\" : \"10.1016/j.bpj.2010.06.068\", },
     {\"2\" : \"pmid\", \"a\" : \"pmid:20923669\", },
     {\"2\" : \"pmc\",  \"a\" : \"pmc:PMC3042554\", },
     {\"2\" : \"ISSN\", \"a\" : \"0006-3495\", },
     {\"2\" : \"ISSN\", \"a\" : \"1542-0086\", },
     ]",

  Here you see the ugly part with the escaping of quotes.
  But: it doesn't get worse, so we can live with that...

  Note: this method does not allow for repeatable subfields
  of course, which, in principle, are allowed in Marc. But:
  our inputer is an end user, so we don't expect the fancy
  stuff librarians invent to show exotinc things correctly.
  (Joe User wouldn't understand those nifty details anyway,
  and in case neccessary, the library could edit the final
  Marc as Marc in bibedit if the need ever arises.)

Sidenote: In websubmit, we use user defined fields named in
userfriendly Marc notation, ie. we do not have a field like
"title", but only a field "I245__a". Now, if I get the title
by a JSON return I do not need logic in JS that translates
things to and fro and I can easily model imports that ask
several databases without always keeping track of field
names and invent an own naming convention. It's just all
Marc, definition given at LoC. (In the last example we have
some values of DB A that get overwritten by DB B as B
usually has better catlaoguing, eg.)

Given all this, the full record for the above example
(pmid:20923669) looks like this:

   {
    "I0247_"  : "[ {\"2\" : \"doi\", \"a\" :
\"10.1016/j.bpj.2010.06.068\", },  {\"2\" : \"pmid\", \"a\" :
\"pmid:20923669\", },  {\"2\" : \"pmc\", \"a\" : \"pmc:PMC3042554\", },
 {\"2\" : \"inh\", \"a\" : \"inh:12128365\", },  {\"2\" : \"ISSN\",
\"a\" : \"0006-3495\", },  {\"2\" : \"ISSN\", \"a\" : \"1542-0086\", }, ]",
    "I041__a" : "eng",
    "I082__a" : "570",
    "I1001_"  : "[ {\"a\" : \"Hoerr, Verena\", \"b\" : \"0\", },
{\"a\" : \"Purea, Armin\", \"b\" : \"1\", },  {\"a\" : \"Faber,
Cornelius\", \"b\" : \"2\", }, ]",
    "I1001_a" : "Hoerr, Verena ; Purea, Armin ; Faber, Cornelius ; ",
    "I245__a" : "NMR separation of intra- and extracellular compounds
based on intermolecular coherences.",
    "I260__a" : "[u. a.]",
    "I260__b" : "Biophysical Society",
    "I260__c" : "2010",
    "I520__a" : "NMR spectroscopy is a powerful tool for detection and
characterization of chemical compounds in biological systems. Its
application in pharmaceutical studies in cell cultures, however, has
been hampered by the enormous technical challenges in separating intra-
from extracellular amounts of one substance. We introduce a novel
approach to separate intra- from extracellular NMR signal based on the
detection of intermolecular zero-quantum coherences in presence of a
chemical shift agent. In a sample of large cells in culture, the
investigation of cellular uptake of pharmacological substances becomes
feasible. The addition of 10 mM Tm-DOTP to a suspension of 100 Xenopus
laevis oocytes resulted in sufficient separation of resonance
frequencies between intra- and extracellular water. Upon selective
excitation of either intra- or extracellular water signal, only intra-
or extracellular components were observed, respectively. The presented
localization technique provides intrinsic averaging over a large number
of cells, resulting in a significant signal gain. The method works on
standard NMR spectrometers, which are available at most scientific
research institutions today. On a high-resolution NMR system with a
cryoprobe, a 20-fold sensitivity gain was observed as compared to
conventionally localized NMR spectroscopy of a single X. laevis oocyte
on dedicated NMR microscopes.",
    "I588__a" : "Dataset connected to CrossRef,
zb0027.zb.kfa-juelich.de, PubMed, ",
    "I650_2"  : "[ {\"2\" : \"MeSH\", \"a\" : \"Animals\", },  {\"2\" :
\"MeSH\", \"a\" : \"Biological Factors: chemistry\", },  {\"2\" :
\"MeSH\", \"a\" : \"Biological Factors: isolation & purification\", },
{\"2\" : \"MeSH\", \"a\" : \"Choline\", },  {\"2\" : \"MeSH\", \"a\" :
\"Extracellular Space: chemistry\", },  {\"2\" : \"MeSH\", \"a\" :
\"Intracellular Space: chemistry\", },  {\"2\" : \"MeSH\", \"a\" :
\"Magnetic Resonance Spectroscopy: methods\", },  {\"2\" : \"MeSH\",
\"a\" : \"Oocytes: cytology\", },  {\"2\" : \"MeSH\", \"a\" : \"Oocytes:
metabolism\", },  {\"2\" : \"MeSH\", \"a\" : \"Solvents\", },  {\"2\" :
\"MeSH\", \"a\" : \"Xenopus laevis\", }, ]",
    "I650_7"  : "[ {\"2\" : \"NLM Chemicals\", \"a\" : \"Biological
Factors\", },  {\"2\" : \"NLM Chemicals\", \"a\" : \"Solvents\", },
{\"0\" : \"62-49-7\", \"2\" : \"NLM Chemicals\", \"a\" : \"Choline\", }, ]",
    "I773__0" : "PERI:(DE-600)1477214-0",
    "I773__a" : "10.1016/j.bpj.2010.06.068",
    "I773__g" : "Vol. 99, no. 7, p. 2336 - 2343",
    "I773__n" : "7",
    "I773__p" : "2336 - 2343",
    "I773__q" : "99:7<2336 - 2343",
    "I773__t" : "Biophysical journal",
    "I773__v" : "99",
    "I773__x" : "0006-3495",
    "I773__y" : "2010",
    "I8567_2" : "Pubmed Central",
    "I8567_u" : "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3042554";,
    "I915__"  : "[{ \"a\" : \"JCR/ISI refereed\", \"0\" :
\"StatID:(DE-HGF)0010\", \"2\" : \"StatID\", }, { \"a\" : \"JCR\", \"0\"
: \"StatID:(DE-HGF)0100\", \"2\" : \"StatID\", }, { \"a\" :
\"DBCoverage\", \"0\" : \"StatID:(DE-HGF)0200\", \"2\" : \"StatID\",
\"b\" : \"SCOPUS\", }, { \"a\" : \"DBCoverage\", \"0\" :
\"StatID:(DE-HGF)0300\", \"2\" : \"StatID\", \"b\" : \"Medline\", }, {
\"a\" : \"DBCoverage\", \"0\" : \" StatID:(DE-HGF)0310\", \"2\" :
\"StatID\", \"b\" : \"NCBI Molecular Biology Database\", }, { \"a\" :
\"WoS\", \"0\" : \"StatID:(DE-HGF)0110\", \"2\" : \"StatID\", \"b\" :
\"Science Citation Index\", }, { \"a\" : \"WoS\", \"0\" :
\"StatID:(DE-HGF)0111\", \"2\" : \"StatID\", \"b\" : \"Science Citation
Index Expanded\", }, { \"a\" : \"DBCoverage\", \"0\" :
\"StatID:(DE-HGF)1030\", \"2\" : \"StatID\", \"b\" : \"Current Contents
- Life Sciences\", }, { \"a\" : \"DBCoverage\", \"0\" :
\"StatID:(DE-HGF)1050\", \"2\" : \"StatID\", \"b\" : \"BIOSIS
Previews\", }, ]",
    "SHORTTITLE" : "NMR separation of intra- and extracellular
compounds based on intermolecular coherences. / Hoerr, Verena ;
Biophysical journal 99 2336 - 2343 ;  [u. a.] : Biophysical Society,
2010 ; 10.1016/j.bpj.2010.06.068 ; ",
    "DUPES" : "124524, 130115, "
   }

Note: this does not include the reserved tag "label" which
is what the user usually gets displayed, as this output is
really meant to just fill in the full mask without further
intervention. However, it contains SHORTTITLE and DUPES
which signify the result in a short ISBD-like notation and
possible DUPES in the database by their record ID. (Simple
dupe matching based on values in 0247_a) so we can prevent
the user from erronously add a second record with the same
content or the wrong record (made a typo in the PMID eg.)

The above is actually the output of

  $ ./GenMetadata.pl mode=full pmid=20923669 format=JSON

GenMetadata being the fater of all imports calling itself
various backends to collect data.

Now, we have several bibformat_templates for various sets of
our data and one format JS that calls them appropriately, so
you can usually always specify &of=js for whatever we have.

E.g. for an authority record linkage if you search some name
you'd get something like

   {
   "label"     : "Wagner, G. A. (VDB1731)",
   "I1001_0"   : "P:(DE-Juel1)VDB1731",
   "I1001_a"   : "Wagner, G. A.",
   },
   {
   "label"     : "Wagner, A. (VDB13547)",
   "I1001_0"   : "P:(DE-Juel1)VDB13547",
   "I1001_a"   : "Wagner, A.",
   },
   {
   "label"     : "Wagner, Alexander (ZB / [email protected])",
   "I1001_0"   : "P:(DE-Juel1)133832",
   "I1001_a"   : "Wagner, Alexander",
   "I371__0"   : "I:(DE-Juel1)ZB-20090406",
   "I371__m"   : "[email protected]",
   "I371__c"   : "ZB",
   "I373__0"   : "I:(DE-Juel1)ZB-20090406",
   },

"label" allows the user to select the correct one, the rest
of this hash gets stored to the record.

I think this gives some impression on what we do in the
backend here and how we use JSON. Modulo, of course, of some
technicalities which are not quite clear from the above, but
stem mainly from the use in Websubmit and Ajax itself.

Surely we didn't find the very best way, but one that was
pretty simple to implement and matches the needs. (E.g. most
of the JSONs are actually plain format templates of invenio,
that's why I have a superflous comma in the above output: I
found no way to drop it on the last record in a list of
results without the need to resort to python which in the
above case is not that convenient for some reason.)

HTH :)

--

Kind regards,

Alexander Wagner
Subject Specialist
Central Library
52425 Juelich

mail : [email protected]
phone: +49 2461 61-1586
Fax  : +49 2461 61-6103
www.fz-juelich.de/zb/DE/zb-fi


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Kennen Sie schon unsere app? http://www.fz-juelich.de/app

Re: JSON for MARC

Reply via email to