Re: [CODE4LIB] code4lib lucene pre-conference

Andrew Nagy Wed, 29 Nov 2006 08:17:03 -0800

Clay Redding wrote:

Hi Andrew (or anyone else that cares to answer),


I've missed out on hearing about incompatabilites between MARCXML and
NXDBs.   Can you explain?  Is this just eXist and Sleepycat, or are
there others?  I seem to recall putting a few records in X-Hive with no
problems, but I didn't put it through any paces.


Yes, I have only done my testing with eXist and Sleepycat, but I also
have an implementation of MarkLogic that I would like to test out.  I
imagine though that all NXDBs will have the same problem.  This is the
heart of my proposed talk.  It has to do with the layout of marcxml.
Adding a few records to any NXDB will work like a charm, do your testing
with 250,000+ records and then you will begin to see the true spirit of
your NXDB.

Also, if there was a cure to the problems with MARCXML (I'm sure we can
all think of some), what would you suggest to help alleviate the
problems?


Sure, I know of a cure!  I have come up with a modified marcxml schema,
but as I am investigating SOLR further, I think the solr schema is also
a cure.

The problem with MARXML is the fact that all of the elements have the
same name and then use the attributes to differentiate them, (excuse my
while I barf) this makes indexing at the XML level very difficult,
especially for NXDBs.  I got a concurring agreement from main developers
of both packages (exist, berkeley) in this front.  My schema just puts
all of the marc fields into it's own element.  Instead of <datafield
code="245">, I created a field called <T245> and instead of all of the
subfields in multiple tags, i just put all of the subfields into one
element.  No one needs to search (from my perspective) the subtitle
("b") separately from the main ("a") title, so I just made a really
simple xml document that is 1/4 the size.  By doing this I was able to
take a 45 minute search of marcxml records and reduce it down to results
in 1 second.  The main boost was not the reduction in file size, but the
way the indexing works.

Give it a shot, I promise better results!

Andrew

Re: [CODE4LIB] code4lib lucene pre-conference

Reply via email to