On Mon, 2009-12-14 at 09:42 -0500, Mike Rylander wrote:
> On Mon, Dec 14, 2009 at 12:07 AM, Dan Scott <d...@coffeecode.net> wrote:
> > On Sun, 2009-12-13 at 23:36 -0500, Warren Layton wrote:
> >> On Sun, Dec 13, 2009 at 8:50 PM, Dan Scott <d...@coffeecode.net> wrote:
> >> > That issue notwithstanding, I would be in favour of applying this patch
> >> > to trunk at this time, and with a little more testing and confirmation
> >> > of the fingerprinting goals, I would like to see it backported to the
> >> > 1.6 series.
> >>
> >> Thanks for testing this patch, Dan, and suggesting it for trunk.
> >>
> >> (And I, too, am curious about the goals of fingerprinting and whether
> >> non-ASCII is acceptable.)
> >
> > In my opinion, it has to be acceptable if we want to support metarecord
> > grouping for Armenian and Czech and Russian and Nepalese - all languages
> > that we've either had some translations contributed for, or which have
> > had people working on getting Evergreen running (or both).
> >
> 
> The purpose for removing characters outside the ascii range (well,
> actually, the original design was for removing non-spacing marks in
> NFD characters, but that seems impossible in JS) is to thunk to the
> lowest common denominator -- think Chávez vs Chavez, the like of which
> is extremely common in public library catalogs, especially when
> merging records from institutions with different cataloging standards.

Right, getting to plain ASCII fingerprints cleanly (e.g. not dropping
characters / strings entirely) where possible makes sense in that
context, and I think it's a reasonable default for Evergreen.

> Since we're not aware of anyone actually making use of the mutability
> of the fingerprinter (well, beyond me), I don't have too strong of a
> argument against reimplementing in perl.  However, in order to make it
> nominally possible to retain the functionality, I do feel pretty
> strongly that the main body of the fingerprinting and weighting
> "quality" logic should be segregated into its own file.

Any opinions on where/how this file would be located? Should it just be
a separate Perl module that defines the appropriate subroutines that
then get called by Ingest.pm - something like
OpenILS::Application::Ingest::English.pm - and then we could provide a
sample non-English configuration file / module that could be swapped in
for the less Anglo-centric? Or perhaps language maintainers could
maintain language-specific versions, or (more likely) versions with
common requirements.

> As for the default algorithm, I think removal of non-spacing combining
> marks is pretty important.  Replacing the tr/// with lc() and adding
> s/\p{M}+// will take care of that.

That's okay, as long as it's a configurable option. Icelandic apparently
treats such characters quite differently: o ó and ö are entirely
different characters and shouldn't be folded together. Of course, I
haven't heard anything from any libraries in Iceland yet about adopting
Evergreen so that's an academic concern :)

> The quality metric includes a bump for language so that records in the
> primary language of the catalog will end up (more often than not)
> being used as the lead record in metarecords -- without the bump,
> non-primary-language records would have an advantage simply because
> they have more tags (Romanizations) and that would be suboptimal for
> patrons.  So that, too, is pretty important, and one of the original
> design reasons for the JS implementation.  Leaving English as the
> default seems sane to me, as most adoption of Evergreen is still in
> primarily-English-speaking countries, and for Armenian, Czech,
> Russian, Nepalese and French-Canadian ;) catalogs, that quality
> adjustment can be removed or adjusted appropriately -- local
> modification being a main driver in the current JS implementation.
> 

Fair enough. 

Thanks a ton for jumping in here, Mike, it really helps shed light on
the background and intentions for the fingerprinting approach!

Reply via email to