Re: [CODE4LIB] code4lib lucene pre-conference

Kevin S. Clarke Wed, 29 Nov 2006 08:51:28 -0800

I think this is a data structure problem... MARC is well structured
for compact transmission (or was at one point) but not so much for
data (re)use (in my opinion).


One solution, as Erik has suggested, is to parse the data and build
intelligible indices.  Another, as Andrew suggests (and which I think
Endeca does too at least as a preliminary step), is to map to a more
reasonably arranged data structure (in XML) and index that.

Fwiw Andrew, I'd suggest you are not seeing the "true spirit of your
NXDB."  Try to put MARC into a RDBMS and you are going to run into the
same problem.  You have to index intelligently or reorganize the data
(which is the default when you put XML into a RDBMS anyway).  Perhaps
a criticism of NXDBs could be that they make sound like they can
handle anything you throw at them without regard for what that is...
"If it is XML, we can handle it."

Data can have a structure that makes it more accessible or less.  The
promise of XML (as a storage format rather than transmission format
(which is its other purpose)) is that you can work with data in its
native format (no deconstruction necessary).  However, there is
nothing about XML or NXDBs that makes one use a well structured data
format.

Kevin

ps: I'm still reeling at the idea of Elsevier making Lucene indices
available... wow, neat idea.

On 11/29/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

Clay Redding wrote:

> Hi Andrew (or anyone else that cares to answer),
>
> I've missed out on hearing about incompatabilites between MARCXML and
> NXDBs.   Can you explain?  Is this just eXist and Sleepycat, or are
> there others?  I seem to recall putting a few records in X-Hive with no
> problems, but I didn't put it through any paces.

Yes, I have only done my testing with eXist and Sleepycat, but I also
have an implementation of MarkLogic that I would like to test out.  I
imagine though that all NXDBs will have the same problem.  This is the
heart of my proposed talk.  It has to do with the layout of marcxml.
Adding a few records to any NXDB will work like a charm, do your testing
with 250,000+ records and then you will begin to see the true spirit of
your NXDB.

> Also, if there was a cure to the problems with MARCXML (I'm sure we can
> all think of some), what would you suggest to help alleviate the
> problems?

Sure, I know of a cure!  I have come up with a modified marcxml schema,
but as I am investigating SOLR further, I think the solr schema is also
a cure.

The problem with MARXML is the fact that all of the elements have the
same name and then use the attributes to differentiate them, (excuse my
while I barf) this makes indexing at the XML level very difficult,
especially for NXDBs.  I got a concurring agreement from main developers
of both packages (exist, berkeley) in this front.  My schema just puts
all of the marc fields into it's own element.  Instead of <datafield
code="245">, I created a field called <T245> and instead of all of the
subfields in multiple tags, i just put all of the subfields into one
element.  No one needs to search (from my perspective) the subtitle
("b") separately from the main ("a") title, so I just made a really
simple xml document that is 1/4 the size.  By doing this I was able to
take a 45 minute search of marcxml records and reduce it down to results
in 1 second.  The main boost was not the reduction in file size, but the
way the indexing works.

Give it a shot, I promise better results!

Andrew

Re: [CODE4LIB] code4lib lucene pre-conference

Reply via email to