[CODE4LIB] databases/indexes with well-structured output

Eric Lease Morgan Wed, 06 Nov 2013 07:08:51 -0800

What are some of the more popular and useful bibliographic databases/indexes 
with well-structured output?


If it were easy (trivial) for our readers to gets sets of well-structured data 
out of our bibliographic databases, then it would be relatively easy for us to 
write software enabling readers to use and understand — evaluate — their data. 
What databases/indexes lend themselves to this solution? Let me elaborate.

JSTOR’s Data For Research service provides complete access to the totality of 
JSTOR, sans the articles themselves, unless you are auathorized. [1] A person 
can search JSTOR and then request a data dump compete with citations, keyword 
frequencies, and n-grams. This data can then be used to create a report — like 
a timeline or tag clouds or concordances — illustrating the characteristics of 
the found set. About six months ago I wrote a program, the beginnings of such a 
report. [2]

Suppose a reader diligently used something like Endnote, Zotero, or RefWorks to 
save and manage their bibliographic citations of interest. If the reader were 
to export some or all of their bibliographic data to a file, then the result 
would be well-structured and computer readable. Things like titles, authors, 
keywords/subjects, maybe abstracts, and citations would be neatly delimited. If 
this file were read by a second computer program new views of the data could be 
manifested. Again, a timeline could be created. Wordclouds could be created. An 
analysis could be done against the data to determine frequent authors. 
Relationships between authors might be able to be exposed. All of this would 
assist the reader in evaluating their found set. 

Through the use of APIs I can search things like WorldCat, the HathiTrust, or 
the Internet Archive. The result could be (for better or for worse) MARC 
records. Again, analysis could be done against this data not to find 
information (that has already been done), but rather to evaluate the data — 
look for patterns and anomalies.

Put another way, instead of trying to force people to do the best and most 
perfect bibliographic search, allow them to do broad searches and then provide 
supplementary tools enabling the reader to examine the results. It is not about 
find. It is about use & understand.

I prefer XML to other data structures, but I will not necessarily limit myself 
to XML. What information sources would you suggest I use? Here is a short, 
unordered list:

  * JSTOR Data For Research Data
  * Zotero (RDF) XML output
  * WorldCat, HathiTrust, Internet Archive

After I write the “search results evaluation tool”, I will then go to the next 
step and provide tools for the “distant reading” of individual items á la my 
PDF2TXT application. [3]

We here in libraries can no longer just give people access to information 
because people have more access than they know what to do with. Instead, I 
think an opportunity exists for us to provide tools for evaluating the 
information they have so they can use & understand it. Call it “scalable, 
computer-supplemented information literacy”.


[1] Data For Research - http://dfr.jstor.org
[2] JSTOR Tool — http://dh.crc.nd.edu/sandbox/jstor-tool/
[3] PDF2TXT - http://dh.crc.nd.edu/sandbox/pdf2txt.cgi

—
Eric Morgan
University of Notre Dame

[CODE4LIB] databases/indexes with well-structured output

Reply via email to