Re: [CODE4LIB] marc4j 2.4 released

Jonathan Rochkind Mon, 20 Oct 2008 12:23:11 -0700

To me, "de-duplication" means throwing out some records as duplicates.Are we talking about that, or are we talking about what I call "work setgrouping" and others (erroneously in my opinion) call "FRBRization"?

If the latter, I don't think there is any mature open source softwarethat addresses that yet. Or for that matter, any proprietaryfor-purchase software that you could use as a component in your owntools. Various proprietary software includes a work set grouping featurein it's "black box" (AquaBrowser, Primo, I believe the VTLS ILS). But Idon't know of anything available to do it for you in your own tool.

I've been just starting to give some thought to how to accomplish this,and it's a bit of a tricky problem on several grounds, includingcomputationally (doing it in a way that performs efficiently). Onechoice is whether you group records at the indexing stage, or on-demandat the retrieval stage. Both have performance implications--we reallydon't want to slow down retrieval OR indexing. Usually if you have thechoice, you put the slow down at indexing since it only happens "once"in abstract theory. But in fact, with what we do, when indexing that'salready been optmized and does not have this feature can take hours oreven days with some of our corpuses, and when in fact we do re-indexfrom time to time (including 'incremental' addition to the index of newand changed records)---we really don't want to slow down indexing either.


Jonathan

Bess Sadler wrote:

Hi, Mike.

I don't know of any off-the-shelf software that does de-duplication ofthe kind you're describing, but it would be pretty useful. That wouldbe awesome if someone wanted to build something like that into marc4j.Has anyone published any good algorithms for de-duping? As Iunderstand it, if you have two records that are 100% identical exceptfor holdings information, that's pretty easy. It gets harder when onerecord is more complete than the other, and very hard when one recordhas even slightly different information than the other, to tellwhether they are the same record and decide whose information toprivilege. Are there any good de-duping guidelines out there? When alibrary contracts out the de-duping of their catalog, what kind ofspecific guidelines are they expected to provide? Anyone know?

I remember the open library folks were very interested in thisquestion. Any open library folks on this list? Did that effort tode-dupe all those contributed marc records ever go anywhere?


Bess

On Oct 20, 2008, at 1:12 PM, Michael Beccaria wrote:

Very cool! I noticed that a feature, MarcDirStreamReader, is capable of
iterating over all marc record files in a given directory. Does anyone
know of any de-duplicating efforts done with marc4j? For example,
libraries that have similar holdings would have their records merged
into one record with a location tag somewhere. I know places do it
(consortia etc.) but I haven't been able to find a good open program
that handles stuff like that.

Mike Beccaria
Systems Librarian
Head of Digital Initiatives
Paul Smith's College
518.327.6376
[EMAIL PROTECTED]

---
This message may contain confidential information and is intended only
for the individual named. If you are not the named addressee you should
not disseminate, distribute or copy this e-mail. Please notify the
sender immediately by e-mail if you have received this e-mail by mistake
and delete this e-mail from your system.

-----Original Message-----
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Bess Sadler
Sent: Monday, October 20, 2008 11:12 AM
To: [email protected]
Subject: [CODE4LIB] marc4j 2.4 released

Dear Code4Libbers,

I'm very pleased to announce that for the first time in almost two
years there has been a new release of marc4j. Release 2.4 is a minor
release in the sense that it shouldn't break any existing code, but
it's a major release in the sense that it represents an influx of new
people into the development of this project, and a significant
improvement in marc4j's ability to handle malformed or mis-encoded
marc records.

Release notes are here: http://marc4j.tigris.org/files/documents/
220/44060/changes.txt

And the project website, including download links, is here: http://
marc4j.tigris.org/

We've been using this new marc4j code in solrmarc since solrmarc
started, so if you're using Blacklight or VuFind, you're probably
using it already, just in an unreleased form.

Bravo to Bob Haschart, Wayne Graham, and Bas Peters for making these
improvements to marc4j and getting this release out the door.

Bess

Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University

410.516.8886rochkind (at) jhu.edu

Re: [CODE4LIB] marc4j 2.4 released

Reply via email to