Hi Fernando,

I have started my experience with MARC in Mongo. I have import ~6 million
MARC records (auth and bib) to MongoDB. The steps I took:

1) the source was MARCXML I created with XC OAI Toolkit.
2) I created an XSLT file which creates MARC-JSON from MARCXML
I followed the MARC-JSON draft and not Bill Dueber's MARC HASH
http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The
conversion is not 100% perfect, but from the 6 million records only 20
were converted with some errors, which is enogh error rate for a home
made project.
3) imported the files
4) indexed the files

Lessons learned:
- the import process is moch more quicker than any other part of the workflow.
The 6 million records was imported about 30 minutes, while indexing took
3 hours.
- count() is very slow method for complex queries even after intensive indexing.
but iterating over the results is more quicker.
- there is no way to index part of strings (e.g. splitting the leader or 006/007/008
- full text search is not too quick
- before indexing the size of the index was 9 GB, after full index it was 28 GB
(I should note, that on 32-bit operation system the max size of mongo index
is 2 GB).

- the MARC-JSON format is good for data exchange, but it is not enough precise for searching, since - MARC heritage - distinct information are combined together to single fields (Leader, 008 etc). We should split them into smaller information chunks
before indexing.
- I should learn more about the possibilities of MongoDB

I can give you more technical details, if you interested.

eXtensible Catalog

----- Original Message ----- From: "Fernando Gómez" <fjgo...@gmail.com>
Sent: Thursday, May 13, 2010 2:59 PM
Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

There's been some talk in code4lib about using MongoDB to store MARC
records in some kind of JSON format. I'd like to know if you have
experimented with indexing those documents in MongoDB. From my limited
exposure to MongoDB, it seems difficult, unless MongoDB supports some
kind of "custom indexing" functionality.

According to the MongoDB docs [1], "you can create an index by calling
the ensureIndex() function, and providing a document that specifies
one or more keys to index." Examples of this are:

   db.things.ensureIndex({"city": 1})
   db.things.ensureIndex({"address.city": 1})

That is, you specify the keys giving a path from the root of the
document to the data element you are interested in. Such a path acts
both as the index's name, and as an specification of how to get the
keys's values.

In the case of two proposed MARC-JSON formats [2, 3], I can't see such
"path". For example, say you want an index on field 001. Simplifying,
the JSON docs would look like this

   { "fields" : [ ["001", "001 value"], ... ] }

or this

   { "controlfield" : [ { "tag" : "001", "data" : "fst01312614" }, ... ] }

How would you specify field 001 to MongoDB?

It would be nice to have some kind of custom indexing, where one could
provide an index name and separately a JavaScript function specifying
how to obtain the keys's values for that index.

Any suggestions? Do other document oriented databases offer a better
solution for this?

BTW, I fed MongoDB with the example MARC records in [2] and [3], and
it choked on them. Both are missing some commas :-)

[1] http://www.mongodb.org/display/DOCS/Indexes
[2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
[3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11

Fernando Gómez
Biblioteca "Antonio Monteiro"
INMABB (Conicet / Universidad Nacional del Sur)
Av. Alem 1253
B8000CPB Bahía Blanca, Argentina
Tel. +54 (291) 459 5116

Reply via email to