Re: [CODE4LIB] marc records sample set

Jonathan Rochkind Fri, 09 May 2008 10:41:48 -0700

I think you start with a smaller set, but then when you find
idiosyncratic records that were NOT represented in your smaller set, you
add representative samples to the sample set. The sample set organically
grows.


Certainly at some point you've got to test on a larger set too. But I
think there's a lot of value in having a small test set too. Of course,
it is something of a challenge to even come up with a reasonably
representative small set. But it doesn't need to be absolutely
representative---when you find examples not represented, you add them.
It grows.

Jonathan

Kyle Banerjee wrote:

According to the combined brainstorming of Jonathan Rochkind and
myself, the ideal record set should:

1. contain about 10k records, enough to really see the features, but
small enough that you could index it in a few minutes on a typical
desktop...

5. contain a distribution of typical errors one might encounter with
marc records in the wild


This is much harder to do than might appear on the surface. 10K is a
really small set, and the issue is that unless people know how to
create a set that has really targets the problem areas, you will
inevitably miss important stuff. At the end of the day, it's the
screwball stuff you didn't think about that always causes the most
problems. I think such data sizes are useful for testing interfaces,
but not for determining catalog behavior and setup.

Despite the indexing time, I believe in testing with much larger sets.
There are certain very important things that just can't be examined
with small sets. For example, one huge problem with catalog data is
that the completeness and quality is highly variable. When we were
experimenting sometime back, we found that how you normalize the data
and how you weight terms as well as documents has an enormous impact
on search results and that unless you do some tuning, you will
inevitably find a lot of garbage too close to the top with a bunch of
good stuff ranked so low it isn't found.

kyle


--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu

Re: [CODE4LIB] marc records sample set

Reply via email to