[CODE4LIB] REST Fest 2010

2010-08-05 Thread Benjamin Young
  I enjoyed being a part of the code4lib 2010 conference in Asheville 
this past February, and wanted to return the favor by inviting you all 
to come to an event I'm organizing in Greenville, SC.


REST Fest 2010 and Hypermedia Workshop

Friday, September 17, 2010 at 8:00 AM - Saturday, September 18, 2010 at 
6:00 PM (ET) Greenville, SC


Co-Chairs: Mike Amundsen & Benjamin Young
REST Fest 2010 (Sep 17th & 18th)

REST Fest is a community unconference event focused on the REST 
architectural style and implementations. This year, REST Fest will 
encourage developers who have direct experience building RESTful 
applications for the World Wide Web to share their successes and their 
frustrations in an informal atmosphere. REST Fest will also maintain a 
"Hack Room" open throughout the two-day event where attendees can get 
together and work on any project they like.

http://restfest.org

Call for Presenters

In the spirit of the "Unconference" model, all talks are automatically 
accepted as a "Lightning Talk" (Five Slides in Five Minutes). Presenters 
are encouraged to submit a title, short abstract (250 or less), and an 
indication of the "level" of the talk (beginner, intermediate, 
advanced). "How To..." talks are encouraged as well as "How Do I?" 
talks. A small number of talks will be chosen as "Selected Talks" with a 
format of 30+ minutes. Break out sessions will be added as desired by 
the attendees.

http://restfest2010.eventbrite.com/

Workshop: Hypermedia Hacking with Mike Amundsen (Sep 17th)

In this one-day pre-event workshop, attendees will learn how to 
implement an alternative to one-off Web APIs using Hypermedia Engines. 
The all-day session includes a mix of presentation, discussion, and 
hands-on implementation. Attendees are encouraged to bring laptops and 
"code-along" with supplied examples throughout the day.

http://www.restfest.org/schedule/workshop

Thanks for reading, and I hope to see you there,
Benjamin
--

President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-13 Thread Benjamin Young

On 5/13/10 8:59 AM, Fernando Gómez wrote:

Any suggestions? Do other document oriented databases offer a better
solution for this?
   

Hey Fernando,

I'd suggest you checkout CouchDB. CouchDB uses JSON as it's document 
format, provides advanced indexing (anywhere in the JSON docs) via 
map/reduce queries that are typically written in JavaScript. The 
map/reduce queries are simple lamda JavaScript functions that are part 
of a "design" document (also a simple JSON object) in CouchDB. Check out 
the following two links for more info:

http://books.couchdb.org/relax/design-documents/design-documents
http://books.couchdb.org/relax/design-documents/views

A simple map reduce query using your city and address.city keys would 
look something like this:


function (doc) {
  if (doc.city) {
emit(doc.city, doc);
  } else if (doc.address.city) {
emit(doc.address.city, doc);
  }
}

That function would return the full document representation "keyed" by 
their cities (which is handy for sorting and later reducing by counting 
unique cities).


CouchDB lets you focus on pulling out the data you want, and it handles 
the indexing. Pretty handy. :)


Let me know if you have other questions about CouchDB.

Take care,
Benjamin


Re: [CODE4LIB] it's cool to hate on OpenURL

2010-04-29 Thread Benjamin Young

On 4/29/10 3:48 PM, Boheemen, Peter van wrote:

But all the flaws of XML can be traced back to SGML which is
why we now use JSON despite all of its limitations.
 

excuse me, but JSON is something completely different. It is an object notation 
and in not at all usable to structure data.
Don't let the CouchDB guys know that: http://couchdb.apache.org/ or any 
of the other JSON-based API builders either.

XML is great to describe complex data, but it is often badly used, like in MARC XML (ISO 2709 
described in XML). And it is musunderstood by a lot of programmmers that only thing in strings, 
integers and the lot
In his vision of Xanadu every piece of published information had a
unique ID that was reused everytimes the publication was referenced -
which would solve our problem.
 

Keep on dreaming Jakob :-)
   


Re: [CODE4LIB] it's cool to hate on OpenURL

2010-04-29 Thread Benjamin Young

On 4/29/10 12:32 PM, MJ Suhonos wrote:

What I hope for is that OpenURL 1.0 eventually takes a place alongside SGML as 
a too-complex standard that directly paves the way for a universally adopted 
foundational technology like XML. What I fear is that it takes a place 
alongside MARC as an anachronistic standard that paralyzes an entire industry.
 

Hear hear.

I'm actually encouraged by Benjamin's linking (har har) to the httpRange-14 issue as being relevant 
to the concept of "link resolution", or at least redirection (indirection?) using URL 
surrogates for resources.  Many are critical of the TAG's "resolution" (har har har) of 
the issue, and think it places too much on the 303 redirect.

I'm afraid I still don't understand the issue fully enough to comment — though I'd love 
to hear from any who can.  I agree with Eric's hope that the library world can look to 
W3C's thinking to inform a "better way" forward for link resolving, though.
   
One key thing to remember with the W3C work is that URL's have to be 
dereference-able. I can't lookup (without an OpenLink resolver or Google 
or the like) a url:isbn:{whatever}, but I can dereference 
http://dbpedia.org/resource/The_Lord_of_the_Rings -- which 303's to 
http://dbpedia.org/page/The_Lord_of_the_Rings -- which is full of more 
/resource/ dereferencable (via 303 See Other) URL's.


The main thing that the W3C was trying to avoid was RDF that 
inadvertently talks about online documents when what it really wants to 
talk about is the "real thing." Real things (like books) need a URI, but 
ideally a URI that can be dereferenced (via HTTP in this case) to give 
them some information about that real thing--which isn't possible with 
the urn:isbn style schemes.


That's my primitive understanding of it anyway. Apologies if any 
overlaps with library tech are off. :)


Re: [CODE4LIB] Twitter annotations and library software

2010-04-29 Thread Benjamin Young
I vote (heh) for "d" which will look a lot like "c" anyway, but with 
smatterings of owl:sameAs and Range-14 style 303's to keep things 
interesting. :)


--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


On 4/29/10 2:01 PM, Jakob Voss wrote:

How about a bet instead of voting. In three years will there be:

a) No relevant Twitter annotations anyway
b) Twitter annotations but not used much for bibliographic data
c) A rich variety of incompatible bibliographic annotation standards
d) Semantic Web will have solved every problem anyway 


Re: [CODE4LIB] Twitter annotations and library software

2010-04-29 Thread Benjamin Young
At #ldow2010 on Tuesday there was a presentation on "semantic" Twitter 
via TwitLogic:

http://twitlogic.fortytwo.net/

You can download the full paper if you're really curious:
http://events.linkeddata.org/ldow2010/papers/ldow2010_paper16.pdf

Twitter Annotations system was mentioned at the end as a possible side 
option. There's bound to be a good bit of talk in the Linked Data 
community on strapping RDF/RDFa into Twitter Annotations, but I believe 
that's still beginning.


Additionally (as someone outside of the library community proper), 
OpenURL's dependence on resolvers would be the largest concern. Anyone 
could build similar "real thing" URL's and use 303 See Other redirects 
to return one or more digital resources about that "real thing." See 
this for more information:

http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039

Enjoy the reads,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


On 4/29/10 10:32 AM, Rosalyn Metz wrote:

I'm going to throw in my two cents.

I dont think (and correct me if i'm wrong) we have mentioned once what
a user might actually put in a twitter annotation.  a book title?  an
article title? a link?

i think creating some super complicated thing for a twitter annotation
dooms it to failure.  after all, its twitter...make it short and
sweet.

also the 1.0 document for OpenURL isn't really that bad (yes I have
read it).  a good portion of it is a chart with the different metadata
elements.  also open url could conceivably refer to an animal and then
link to a bunch of resources on that animal, but no one has done that.
  i don't think that's a problem with OpenURL i think thats a problem
with the metadata sent by vendors to link resolvers and librarians
lack of creativity (yes i did make a ridiculous generalization that
was not intended to offend anyone but inevitably it will).  having
been a vendor who has worked with openurl, i know that the informaiton
databases send seriously effects (affects?) what you can actually do
in a link resolver.





On Thu, Apr 29, 2010 at 10:23 AM, Tim Spalding  wrote:
   

Can we just hold a vote or something?

I'm happy to do whatever the community here wants and will actually
use. I want to do something that will be usable by others. I also
favor something dead simple, so it will be implemented. If we don't
reach some sort of conclusion, this is an interesting waste of time. I
propose only people engaged in doing something along these lines get
to vote?

Tim

 


Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?

2010-04-12 Thread Benjamin Young

On 4/12/10 5:04 PM, Andrew Hankinson wrote:

Couldn't you do MARC ->  MARCXML ->  JSON?

-Andrew
   
Certainly, but the hard part is knowing what you want MARC to look like 
once it's in JSON. XML 2 JSON conversions generally need some "love" to 
make the data meaningful on the JSON side (as attributes and such make a 
1-to-1 conversion complicated--though there have been attempts at 
general conversion scripts).


Once a JSON output format for MARC is done, then converting from MARCXML 
to marc.json (or whatever) would be an "easy" first step.

On 2010-04-12, at 5:00 PM, Benjamin Young wrote:

   

On 4/12/10 4:47 PM, Ryan Eby wrote:
 

You could put your logs, marc records broken out by fields or
arrays/hashes (types in couchdb) in any of them but the approach each
takes would limit you (or empower you) differently.

   

Once there's a good marc2json script (and format) out there, it'd be grand to see marc 
records dumped into CouchDB to allow them to be replicated between groups of librarians 
(and even up to OpenLibrary). I'm still up for helping make that possible if anyone's 
"into" that. :)
 


Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?

2010-04-12 Thread Benjamin Young

On 4/12/10 4:47 PM, Ryan Eby wrote:

You could put your logs, marc records broken out by fields or
arrays/hashes (types in couchdb) in any of them but the approach each
takes would limit you (or empower you) differently.
   
Once there's a good marc2json script (and format) out there, it'd be 
grand to see marc records dumped into CouchDB to allow them to be 
replicated between groups of librarians (and even up to OpenLibrary). 
I'm still up for helping make that possible if anyone's "into" that. :)


Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?

2010-04-12 Thread Benjamin Young
From my understanding of key/value stores, one can put documents on the 
other side of the key, but any and all parsing/processing of that value 
happens outside of the database. In CouchDB, the entire document is 
query-able from within map/reduce views. After being querying on, those 
keys are indexed for faster future queries. So, in that way, CouchDB 
jumps over the key/value limitations and becomes a document database.


In addition to map/reduce output, there's also a handy _update system 
that can be used to validate a JSON document prior to it's insertion in 
the database--again, something not possible with key/value storage.


You can, though, use CouchDB in a key/value fashion by storing binary 
data (or HTML, XML, RDF, etc) as attachments or JSON encoded strings 
(where possible). In that case, you would just be retrieving them by id 
(or URL), but you could store all kinds of ad hoc metadata about those 
attachments and use those to query with later.


Also, the blog article Ryan Eby just posted, is a great (and quick) 
overview of the varied noSQL ecosystem. In many ways, these systems are 
as different as they are similar.


Hope you (re)search goes well,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


On 4/12/10 2:42 PM, Jonathan Rochkind wrote:

Yeah, I may have gotten it completely wrong.

Okay, help this grasshopper (possibly by pointing me to relevant 
documentation), what's the difference between "document-based" and 
"key-value store"?  When I've looked at CouchDB before, despite it 
describing itself as "document based", I haven't been able to tell 
what the difference is between it and a "key value store".  It seemed 
to support storing a "document" by key, and retrieving it by key.  It 
didn't seem to _do_ anything special with the document other than 
storing it there (maybe it DOES, but I missed it?).  So you can call 
it a "document" instead of a "value", but I couldn't figure out how 
that differed from a key-value store.


I guess it's that CouchDB _does_ let you build indexes on values other 
than the key?  Wacky, wonder how I missed that when I reviewed it last.


Jonathan

Ross Singer wrote:
On Mon, Apr 12, 2010 at 12:22 PM, Jonathan Rochkind 
 wrote:

The thing is, the NoSQL stuff is pretty much just a key-value store.
 There's generally no way to "query" the store, instead you can 
simply look

up a document by ID.


Actually, this depends largely on the NoSQL DBMS in question.  Some
are key value stores (Redis, Tokyo Cabinet, Cassandra), some are
document-based (CouchDB, MongoDB), some are graph-based (Neo4J), so I
think blanket statements like this are somewhat misleading.

CouchDB and MongoDB (for example) have the capacity to index the
values within the document - you don't just have to look up things by
document ID.

-Ross.



Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?

2010-04-12 Thread Benjamin Young
SQL-style JOINs can be done in CouchDB (can't speak for the other NoSQL 
DB's).


In CouchDB, it's called view collation:
http://chrischandler.name/couchdb/view-collation-for-join-like-behavior-in-couchdb/

It's a different way of thinking (as there are no tables, and map/reduce 
goes through every document to generate it's output), but it is possible 
to get interestingly combined data out of the whole database.


Later,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


On 4/12/10 11:08 AM, Robert Sanderson wrote:

Depends on the sort of features required, in particular the access
patterns, and the hardware it's going to run on.

In my experience, NoSQL systems (for example apache's Cassandra) have
extremely good distribution properties over multiple machines, much
better than SQL databases.  Essentially, it's easier to store a bunch
of key/values in a distributed fashion, as you don't need to do joins
across tables (there aren't any) and eventually consistent systems
(such as Cassandra) don't even need to always be internally consistent
between nodes.

If many concurrent write accesses are required, then NoSQL can also be
a good choice, for the same reasons as it's easily distributed.
And for the same reasons, it can be much faster than SQL systems with
the same data given a data model that fits the access patterns.

The flip side is that if later you want to do something that just
requires the equivalent of table joins, it has to be done at the
application level.  This is going to be MUCH MUCH slower and harder
than if there was SQL underneath.


Rob


On Mon, Apr 12, 2010 at 7:55 AM, Thomas Dowling  wrote:
   

So let's say (hypothetically, of course) that a colleague tells you he's
considering a NoSQL database like MongoDB or CouchDB, to store a couple
tens of millions of "documents", where a document is pretty much an
article citation, abstract, and the location of full text (not the full
text itself).  Would your reaction be:

"That's a sensible, forward-looking approach.  Lots of sites are putting
lots of data into these databases and they'll only get better."

"This guy's on the bleeding edge.  Personally, I'd hold off, but it could
work."

"Schedule that 2012 re-migration to Oracle or Postgres now."

"Bwahahahah!!!"

Or something else?



(  is a good jumping-in point.)


--
Thomas Dowling
tdowl...@ohiolink.edu

 


Re: [CODE4LIB] NoSQL - is this a real thing or a flash in the pan?

2010-04-12 Thread Benjamin Young
I'd actually vote for the "sensible, forward-looking approach." The BBC 
(for one) is already using CouchDB in a production: 
http://damienkatz.net/2010/03/bbc_and_couchdb.html


That said, NoSQL as a "movement" is as wide and varied as the RDBMS 
world, and there are pros and cons to each. I'm personally a proponent 
of CouchDB because it's RESTful API, JSON storage system, and JavaScript 
(or Erlang, PHP, Python, Ruby, etc) map/reduce view engine. If your 
project need replication at all (whether for scaling, data sharing, 
etc), I'd take a good hard look at CouchDB as that's it's core 
distinction among the other NoSQL databases.


Hope that helps,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


On 4/12/10 10:55 AM, Thomas Dowling wrote:

So let's say (hypothetically, of course) that a colleague tells you he's
considering a NoSQL database like MongoDB or CouchDB, to store a couple
tens of millions of "documents", where a document is pretty much an
article citation, abstract, and the location of full text (not the full
text itself).  Would your reaction be:

"That's a sensible, forward-looking approach.  Lots of sites are putting
lots of data into these databases and they'll only get better."

"This guy's on the bleeding edge.  Personally, I'd hold off, but it could
work."

"Schedule that 2012 re-migration to Oracle or Postgres now."

"Bwahahahah!!!"

Or something else?



(  is a good jumping-in point.)


   


Re: [CODE4LIB] newbie

2010-03-25 Thread Benjamin Young

He means JavaScript. ;)

Honestly, though, PHP and all it's fault not withstanding, I highly 
recommend starting with a C syntax-based language such as JavaScript, 
PHP, Java, or even C# (and obviously C and C++). Get some basic 
programming concepts understood, and then pursue the language the fits 
the bill for the task you're trying to solve.


Most languages share some similarities, so moving between them gets 
easier as you go a long. Starting with a C syntax-based language will 
put you in good stead for learning several more (the list above is by no 
means exhaustive).


If you want to check out some language usage statistics, I recommend 
these two sites:

http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
http://langpop.com/

And do join the #code4lib IRC channel. It's enjoyable regardless of the 
language you pick. :)


On 3/25/10 11:36 AM, Gabriel Farrell wrote:

You should /join #code4lib. Only there will you learn the secret one
true path to wisdom.

On Thu, Mar 25, 2010 at 11:31 AM, Matthew Bachtell
  wrote:
   

As someone who uses PHP to do the small things I would recommend using
Python or another language.  I am trying to transition away from PHP to
Python as it is not a panacea.  PHP's great for web scripting but was never
intended to do all of the duct taped projects that I have put together with
it.



On Thu, Mar 25, 2010 at 10:56 AM, Yitzchak Schaffer<
yitzchak.schaf...@gmx.com>  wrote:

 

On 3/24/2010 17:43, Joe Hourcle wrote:

   

I know there's a lot of stuff written in it, but *please* don't
recommend PHP to beginners.

Yes, you can get a lot of stuff done with it, but I've had way too many
incidents where newbie coders didn't check their inputs, and we've had
to clean up after them.

 

Another way of looking at this: part of learning a language is learning its
vulnerabilities and how to deal with them.  And how to avoid security holes
in web code in general.

--
Yitzchak Schaffer
Systems Manager
Touro College Libraries
33 West 23rd Street
New York, NY 10010
Tel (212) 463-0400 x5230
Fax (212) 627-3197
Email yitzchak.schaf...@tourolib.org

Access Problems? Contact systems.libr...@touro.edu

   
 


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-08 Thread Benjamin Young

On 3/6/10 6:59 PM, Houghton,Andrew wrote:

From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Saturday, March 06, 2010 05:11 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter

Anyway, hopefully, it won't be a huge surprise that I don't disagree
with any of the quote above in general; I would assert, though, that
application/json and application/marc+json should both return JSON
(in the same way that text/xml, application/xml, and
application/marc+xml can all be expected to return XML).
Newline-delimited json is starting to crop up in a few places
(e.g. couchdb) and should probably have its own mime type
and associated extension. So I would say something like:

application/json -- return json (obviously)
application/marc+json  -- return json
application/marc+ndj  -- return newline-delimited json
 

This sounds like consensus on how to deal with newline-delimited JSON in a 
standards based manner.

I'm not familiar with CouchDB, but I am using MongoDB which is similar.  I'll 
have to dig into how they deal with this newline-delimited JSON.  Can you 
provide any references to get me started?
   
Rather than using a newline-delimited format (the whole of which would 
not together be considered a valid JSON object) why not use the JSON 
array format with or without new lines? Something like:


[{"key":"value"}, {"key","value"}]

You could include new line delimiters after the "," if you needed to 
make pre-parsing easier (in a streaming context), but may be able to get 
away with just looking for the next "," or "]" after each valid JSON object.


That would allow the entire stream, if desired, to be saved to disk and 
read in as a single JSON object, or the same API to serve smaller JSON 
collections in a JSON standard way.


CouchDB uses this array notation when returning multiple document 
revisions in one request. CouchDB also offers a slightly more annotated 
structure (which might be useful with streaming as well):


{
  "total_rows": 2,
  "offset": 0,
  "rows":[{"key":"value"}, {"key","value"}]
}

Rows here plays the same roll as the above array-based format, but 
provides an initial row count for the consumer to use (if it wants) for 
knowing what's ahead. The "offset" key is specific to CouchDB, but 
similar application specific information could be stored in the "header" 
of the JSON object using this method.

In all cases, we should agree on a standard record serialization,
though, and the pure-json returns should include something that
indicates what the heck it is (hopefully a URI that can act as a
distinct "namespace"-type identifier, including a version in it).
 

I agree that our MARC-JSON serialization needs some "namespace" identifier in 
it and it occurred to me that the way it is handling indicators, e.g., ind1 and ind2 
properties, might be better handled as an array to accommodate IFLA's MARC-XML-ish where 
they can have from 1-9 indicator values.

BTW, our MARC-JSON content is specified in Unicode not MARC-8, per the JSON 
standard, which means you need to use \u notation to specify characters in 
strings, not sure I made that clear in earlier posts.  A downside to the 
current ECMA 262 specification is that it doesn't support \U00XX, as Python 
does, for the extended characters.  Hopefully that will get rectified in a 
future ECMA 262 specification.

   

The question for me, I think, is whether within this community,  anyone
who provides one of these types (application/marc+json and
application/marc+ndj) should automatically be expected to provide both.
I don't have an answer for that.
 
As far as mime-type declarations go in general, I'd recommend avoiding 
any format specific mime types and sticking to the application/json 
format and providing document level hints (if needed) for the content 
type. If you do find a need for the special case mime types, I'd 
recommend still responding to Accepts: application/json whenever 
possible--for the sake of standards. :)


All told, I'm just glad to see this discussion being had. I'll be happy 
to provide some CouchDB test cases (replication, etc) if that's of 
interest to anyone.


Thanks,
Benjamin

I think this issue gets into familiar territory when dealing with RDF formats.  
Let's see, there is N3, NT, XML, Turtle, etc.  Do you need to provide all of 
them?  No, but it's nice of the server to at least provide NT or Turtle and 
XML.  Ultimately it's up to the server.  But the only difference between use 
cases #2 and #3 is whether the output is wrapped in an array, so it's probably 
easy for the server to produce both.

Depending on how much time I get next week I'll talk with the developer network 
folks to see what I need to do to put a specification under their 
infrastructure.  Looks like from my schedule it's going to be another week of 
hell :(


Andy.
   


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 3:45 PM, Bill Dueber wrote:

On Fri, Mar 5, 2010 at 3:14 PM, Houghton,Andrew  wrote:


   

As you point out JSON streaming doesn't work with all clients and I am
hesitent to build on anything that all clients cannot accept.  I think part
of the issue here is proper API design.  Sending tens of megabytes back to a
client and expecting them to process it seems like a poor API design
regardless of whether they can stream it or not.  It might make more sense
to have a server API send back 10 of our MARC-JSON records in a JSON
collection and have the client request an additional batch of records for
the result set.  In addition, if I remember correctly, JSON streaming or
other streaming methods keep the connection to the server open which is not
a good thing to do to maintain server throughput.

 

I guess my concern here is that the specification, as you're describing it,
is closing off potential uses.  It seems fine if, for example, your primary
concern is javascript-in-the-browser, and browser-request,
pagination-enabled systems might be all you're worried about right now.

That's not the whole universe of uses, though. People are going to want to
dump these things into a file to read later -- no possibility for pagination
in that situation. Others may, in fact, want to stream a few thousand
records down the pipe at once, but without a streaming parser that can't
happen if it's all one big array.

I worry that as specified, the *only* use will be, "Pull these down a thin
pipe, and if you want to keep them for later, or want a bunch of them, you
have to deal with marc-xml." Part of my incentive is to *not* have to use
marc-xml, but in this case I'd just be trading one technology I don't like
(marc-xml) for two technologies, one of which I don't like (that'd be
marc-xml again).

I really do understand the desire to make this parallel to marc-xml, but
there's a seem between the two technologies that makes that a problematic
approach.
   
For my part, I'd like to explore the options of putting MARC data into 
CouchDB (which stores documents as JSON) which could then open the door 
for replicating that data between any number of installations of CouchDB 
as well as providing for various output formats (marc-xml, etc).


It's just an idea, but it's one that uses JSON outside of the browser 
and is a good proof case for any MARC in JSON format.


Thanks,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 2:46 PM, Ross Singer wrote:

On Fri, Mar 5, 2010 at 2:06 PM, Benjamin Young  wrote:

   

A CouchDB friend of mine just pointed me to the BibJSON format by the
Bibliographic Knowledge Network:
http://www.bibkn.org/bibjson/index.html

Might be worth looking through for future collaboration/transformation
options.
 

marc-json and BibJSON serve two different purposes:  marc-json would
need to be a loss-less serialization of a MARC record which may or may
not contain bibliographic data (it may be an authority, holding or CID
record, for example).  BibJSON is more of a merging of data model and
serialization (which, admittedly, is no stranger to MARC) for the
purpose of bibliographic /citations/.  So it will probably be lossy
and there would most likely be a lot of MARC data that is out of
scope.

That's not to say it wouldn't be useful to figure out how to get from
MARC->BibJSON, but from my perspective it's difficult to see the
advantage it brings (being tied to JSON) vs. BIBO.

-Ross.
   
Thanks for the clarification, Ross. I thought it would be helpful (if 
nothing else) to see how data was being mapped in a related domain into 
and out of JSON. I'm new to library data in general, so I appreciate the 
clarification on which format is for what.


Appreciated,
Benjamin


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 1:10 PM, Houghton,Andrew wrote:

From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Bill Dueber
Sent: Friday, March 05, 2010 12:30 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Q: XML2JSON converter

On Fri, Mar 5, 2010 at 12:01 PM, Houghton,Andrew
wrote:

 

Too bad I didn't attend code4lib.  OCLC Research has created a
   

version of
 

MARC in JSON and will probably release FAST concepts in MARC binary,
MARC-XML and our MARC-JSON format among other formats.  I'm wondering
whether there is some consensus that can be reached and standardized
   

at LC's
 

level, just like OCLC, RLG and LC came to consensus on MARC-XML.
  Unfortunately, I have not had the time to document the format,
   

although it
 

fairly straight forward, and yes we have an XSLT to convert from
   

MARC-XML to
 

MARC-JSON.  Basically the format I'm using is:


   

The stuff I've been doing:

   http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

... is pretty much the same, except:
 

I decided to stick closer to a MARC-XML type definition since its would be 
easier to explain how the two specifications are related, rather than take a 
more radical approach in producing a specification less familiar.  Not to say 
that other approaches are bad, they just have different advantages and 
disadvantages.  I was going for simple and familiar.

I certainly would be will to work with LC on creating a MARC-JSON specification 
as I did in creating the MARC-XML specification.


Andy.
   
A CouchDB friend of mine just pointed me to the BibJSON format by the 
Bibliographic Knowledge Network:

http://www.bibkn.org/bibjson/index.html

Might be worth looking through for future collaboration/transformation 
options.


Benjamin


Re: [CODE4LIB] Q: XML2JSON converter

2010-03-05 Thread Benjamin Young

On 3/5/10 8:15 AM, Godmar Back wrote:

On Fri, Mar 5, 2010 at 3:59 AM, Ulrich Schaeferwrote:

   

Hi,
try this: http://code.google.com/p/xml2json-xslt/


 

I should have mentioned that I already tried everything I could find after
googling - this stylesheet doesn't meet the requirements, not by far. It
drops attributes just like simplexml_json does.

The one thing I didn't try is a program called 'BadgerFish.php' which I
couldn't locate - Google once indexed it at badgerfish.ning.com

  - Godmar
   

Godmar,

I'd be interested in collaborating with you on creating one. I'd bounced 
this question off the CouchDB IRC channel a while back, and the summary 
was that you'd generally create a JSON structure for your document and 
then right the code to map the XML to JSON. However, I do think 
something more "generic" like Google's GData to JSON would fit the bill 
for most use cases...sadly, it doesn't seem they've made the conversion 
code available.


If you're looking at putting MARC into JSON, there was some discussion 
of that during code4lib 2010. Johnathan Rochkind, who was at code4lib 
2010 blogged about marc-json recently:

http://bibwild.wordpress.com/2010/03/03/marc-json/
He references a project that Bill Dueber's been playing with for a year:
http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/

All told, there's growing momentum for a MARC in JSON format to be 
created, so you might jump in there.


Additionally, I'd love to find a project building code to do what 
Google's done with the GData to JSON format. If you find one, I'd enjoy 
seeing it.


Thanks, Godmar,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung


Re: [CODE4LIB] faceted browsing

2010-02-08 Thread Benjamin Young
Have you seen the Exhibit library (part of the Simile project at MIT)? 
It provides faceted browsing along with map integration:

http://www.simile-widgets.org/exhibit/

It should be fairly easy to add to an existing project as it can consume 
a pretty simple JSON format that your app could provide.


Since you're familiar with CakePHP, it would be very easy to turn 
parseExtensions on in your routes.php file and provide specific views 
for ".json" requests (they'd be stored in views/audio/json/index.ctp for 
instance).


The Exhibit JSON format is based on some RDF concepts I believe, so if 
you're into that at all, it will be doubly enjoyable. :)


Hope that helps,
Benjamin

On 2/8/10 1:31 PM, Ethan Gruber wrote:

I just checked up on CollectiveAccess' features, and the newest version has
faceting search/browse now, so you may want to try that.  They support
uploading videos as well.  http://www.collectiveaccess.org/about/overview

Ethan

On Mon, Feb 8, 2010 at 1:25 PM, Ethan Gruber  wrote:

   

I think Omeka may be a good fit for you, but there currently isn't faceted
searching, though a Solr plugin is currently in development.  You have a
very specific set of requirements, so I'm not sure that any single CMS/DAM
will work in precisely the way you want right out of the box, but Omeka
could very well be the closest thing.  It's written in the Zend framework
for PHP.  I know that there is great demand for a Solr plugin for Omeka.
It's in the Omeka svn repo, but it's not really ready yet for primetime.

Ethan Gruber
University of Virginia Library


On Mon, Feb 8, 2010 at 11:58 AM, Earles, Jill Denaewrote:

 

I would like recommendations for faceted browsing systems that include
authentication, and easily support multimedia content and metadata.  The
ability to add comments and tags to content, and browse by tag cloud is
also desirable.

My skills include ColdFusion, PHP, CakePHP, and XML/XSL.  The only
system I've worked with that includes faceted browsing is XTF, and I
don't think it's well suited to this.  I am willing to learn a new
language/technology if there is a system that includes most of what I'm
looking for.

Please let me know of any open-source systems you know of that might be
suited to this.  If you have time and interest, see the detailed
description of the system below.

Thank you,
Jill Earles

Detailed description:

I am planning to build a system to manage a collection of multimedia
artwork, to include audio, video, images, and text along with
accompanying metadata.  The system should allow for uploading the
content and entering metadata, and discovery of content via searching
and faceted browsing.  Ideally it will also include a couple of ways of
visually representing the relationships between items (for example, a
video and the images and audio files that are included in the video, and
notes about the creative process).  The views we've conceived of at this
point include a "flow" view that shows relationships with arrows between
them (showing chronology or this begat that relationship), and a
"constellation" view that shows all of the related items, with or
without lines between them.

It needs to have security built in so that only contributing members can
search and browse the contributions by default.  Ideally, there would be
an approval process so that a contributor could propose making a work
public, and if all contributors involved in the work (including any
components of the work, i.e. the images and audio files included in the
video) give their approval, the work would be made public.  The public
site would also have faceted browsing, searching by all metadata that we
make public, and possibly tag clouds, and the ability to add tags and
comments about the work.

   


 


Re: [CODE4LIB] Q: what is the best open source native XML database

2010-01-18 Thread Benjamin Young

Hey Godmar,

I'd definitely consider CouchDB as Patrick mentioned. It's a 
"schema-free" JSON document database and replication is it's greatest 
strength.


It does have Lucene integration:
http://github.com/rnewson/couchdb-lucene
Paul J. Davis of the core CouchDB team has a nice write-up:
http://www.davispj.com/2009/01/18/couchdb-lucene-indexing.html

There's also some Solr integration available:
http://github.com/deguzman/couchdb-solr2

From what you've described, CouchDB would be a great choice for your 
application.


Hope that's helpful, Godmar,
Benjamin

--
President
BigBlueHat
P: 864.232.9553
W: http://www.bigbluehat.com/
http://www.linkedin.com/in/benjaminyoung 




Godmar Back wrote:

Hi,

we're currently looking for an XML database to store a variety of
small-to-medium sized XML documents. The XML documents are
unstructured in the sense that they do not follow a schema or DTD, and
that their structure will be changing over time. We'll need to do
efficient searching based on elements, attributes, and full text
within text content. More importantly, the documents are mutable.
We'll like to bring documents or fragments into memory in a DOM
representation, manipulate them, then put them back into the database.
Ideally, this should be done in a transaction-like manner. We need to
efficiently serve document fragments over HTTP, ideally in a manner
that allows for scaling through replication. We would prefer strong
support for Java integration, but it's not a must.

Have other encountered similar problems, and what have you been using?

So far, we're researching: eXist-DB (http://exist.sourceforge.net/ ),
Base-X (http://www.basex.org/ ), MonetDB/XQuery
(http://www.monetdb.nl/XQuery/ ), Sedna
(http://modis.ispras.ru/sedna/index.html ). Wikipedia lists a few
others here: http://en.wikipedia.org/wiki/XML_database
I'm wondering to what extent systems such as Lucene, or even digital
object repositories such as Fedora could be coaxed into this usage
scenario.

Thanks for any insight you have or experience you can share.

 - Godmar