On Mon, 2009-12-14 at 09:42 -0500, Mike Rylander wrote: > On Mon, Dec 14, 2009 at 12:07 AM, Dan Scott <d...@coffeecode.net> wrote: > > On Sun, 2009-12-13 at 23:36 -0500, Warren Layton wrote: > >> On Sun, Dec 13, 2009 at 8:50 PM, Dan Scott <d...@coffeecode.net> wrote: > >> > That issue notwithstanding, I would be in favour of applying this patch > >> > to trunk at this time, and with a little more testing and confirmation > >> > of the fingerprinting goals, I would like to see it backported to the > >> > 1.6 series. > >> > >> Thanks for testing this patch, Dan, and suggesting it for trunk. > >> > >> (And I, too, am curious about the goals of fingerprinting and whether > >> non-ASCII is acceptable.) > > > > In my opinion, it has to be acceptable if we want to support metarecord > > grouping for Armenian and Czech and Russian and Nepalese - all languages > > that we've either had some translations contributed for, or which have > > had people working on getting Evergreen running (or both). > > > > The purpose for removing characters outside the ascii range (well, > actually, the original design was for removing non-spacing marks in > NFD characters, but that seems impossible in JS) is to thunk to the > lowest common denominator -- think Chávez vs Chavez, the like of which > is extremely common in public library catalogs, especially when > merging records from institutions with different cataloging standards.
Right, getting to plain ASCII fingerprints cleanly (e.g. not dropping characters / strings entirely) where possible makes sense in that context, and I think it's a reasonable default for Evergreen. > Since we're not aware of anyone actually making use of the mutability > of the fingerprinter (well, beyond me), I don't have too strong of a > argument against reimplementing in perl. However, in order to make it > nominally possible to retain the functionality, I do feel pretty > strongly that the main body of the fingerprinting and weighting > "quality" logic should be segregated into its own file. Any opinions on where/how this file would be located? Should it just be a separate Perl module that defines the appropriate subroutines that then get called by Ingest.pm - something like OpenILS::Application::Ingest::English.pm - and then we could provide a sample non-English configuration file / module that could be swapped in for the less Anglo-centric? Or perhaps language maintainers could maintain language-specific versions, or (more likely) versions with common requirements. > As for the default algorithm, I think removal of non-spacing combining > marks is pretty important. Replacing the tr/// with lc() and adding > s/\p{M}+// will take care of that. That's okay, as long as it's a configurable option. Icelandic apparently treats such characters quite differently: o ó and ö are entirely different characters and shouldn't be folded together. Of course, I haven't heard anything from any libraries in Iceland yet about adopting Evergreen so that's an academic concern :) > The quality metric includes a bump for language so that records in the > primary language of the catalog will end up (more often than not) > being used as the lead record in metarecords -- without the bump, > non-primary-language records would have an advantage simply because > they have more tags (Romanizations) and that would be suboptimal for > patrons. So that, too, is pretty important, and one of the original > design reasons for the JS implementation. Leaving English as the > default seems sane to me, as most adoption of Evergreen is still in > primarily-English-speaking countries, and for Armenian, Czech, > Russian, Nepalese and French-Canadian ;) catalogs, that quality > adjustment can be removed or adjusted appropriately -- local > modification being a main driver in the current JS implementation. > Fair enough. Thanks a ton for jumping in here, Mike, it really helps shed light on the background and intentions for the fingerprinting approach!