I want to put this out there for discussion/inspection, as this is my first time touching the SIP code.
The basic problem is that we purchased a brand-spanking-new 3M self-check unit just before Christmas, plugged it in, and were happy to see it working with the OpenNCIP / Evergreen SIP server right out of the box. That was the case, at least, until we ran some items through that had accented characters in the title field (e.g. "Présentation et synthèse d'une évaluation romande") - where we found the self-check threw an error about "bad checksum - too many retries" and turned itself off. Digging into the code, I found a few things: OpenILS::SIP::Item.pm contained a line that tried to remove accents from characters before returning the title; however, it assumed that the data was in Normalization Form D (NFD) format, and our data is in Normalization Form C (NFC) format. The simple fix to that was just to wrap the text in a Unicode::Normalization::NFD() call and things were okay, at least for text that could be represented as ASCII characters with accents. There were many other text fields that could be returned that would possibly contain non-ASCII characters - the user's own name, user address, library names, etc - and these were not being escaped the same way as the book titles. It bugged me more, however, that our self-check unit offered UTF8 as an communication encoding option in the SIP configuration interface, yet the Evergreen code was trying to dumb everything down to plain ASCII. In the self-check unit trace logs, I could see the Unicode data coming through, but the checksum was bad. So I dug into the OpenNCIP code and found that it calculates the checksum using Unicode characters (%U in the unpack() format) rather than bytes (%C). Trying to travel back in time and jumping into the minds of the original developers of the 3M spec, I guessed that they probably didn't anticipate multi-byte characters (hey, it was all the way back in 1997, according to atz) and tried changing the checksum algorithm implementation to just calculate the checksum on a byte-wise basis instead. In addition, I opted to try to treat all of the text as UTF8 by explicitly decoding & encoding it as UTF8, rather than trying to just strip combining characters from the data. (I did a lot of reading on Perl + Unicode over the holidays, good holiday fun!) This combination made our self-check unit quite happy. It prints out receipts with Unicode character data and doesn't complain about bad checksums. So this code is already live at Laurentian University, for what that's worth. Attached, please find a patch for the OpenILS::SIP::* code that defaults to enabling UTF8 data in a new subroutine clean_text() in OpenILS::SIP, but by uncommenting two lines will revert to the previous "strip combining characters" approach (with the addition of the NFD() call). I have tried to apply the clean_text() call to all of the areas where one would possibly pass Unicode data over the wire. Comments, suggestions, and enhancements are quite welcome. For example, one could conceivably define a target encoding in oils_sip.xml (e.g. "<encoding>iso-8859-1</encoding>") and have clean_text() invoke encode($target_encoding, $text) rather than assuming UTF-8, as I imagine there is a need to support older self-check units. An alternate implementation of the NFD + strip combining characters approach for really old SIP clients and gnarly data could be something like encode('ascii', $text). I opened a tracker artifact for the checksum algorithm in the OpenNCIP project at https://sourceforge.net/support/tracker.php?aid=2925760 with a patch and the gory details about the checksums and logs. Some good Perl/Unicode reading, if you're interested in digging into this subject: 1. http://juerd.nl/site.plp/perluniadvice 2. http://perldoc.perl.org/perlunitut.html 3. http://perldoc.perl.org/perlunicode.html 4. http://perldoc.perl.org/Unicode/Normalize.html 5. http://perldoc.perl.org/Encode.html
=== modified file 'Open-ILS/src/perlmods/OpenILS/SIP.pm' --- Open-ILS/src/perlmods/OpenILS/SIP.pm 2008-10-28 13:27:57 +0000 +++ Open-ILS/src/perlmods/OpenILS/SIP.pm 2010-01-04 04:46:47 +0000 @@ -20,6 +20,8 @@ use OpenILS::Application::AppUtils; use OpenSRF::Utils qw/:datetime/; use DateTime::Format::ISO8601; +use Encode; +use Unicode::Normalize; my $U = 'OpenILS::Application::AppUtils'; my $editor; @@ -104,6 +106,52 @@ return $e; } +=head2 clean_text(scalar) + +Evergreen uses the UTF8 encoding for everything from the database up. Perl +doesn't know this, however, so we have to convince it to treat our UTF8 strings +as UTF8 strings. This may enable OpenNCIP to correctly calculate the checksums +for UTF8 text for SIP clients that support such modern options. + +The default implementation preserves all Unicode characters and simply ensures +that they are recognized by Perl as using the UTF8 encoding so that Unicode +semantics are employed. + +=over + +=item * Make this a configurable option, rather than force hardcoded +changes to the Perl module to switch to the diacritic-stripping +mode + +=back + +=cut + +sub clean_text { + my $text = shift || ''; + + $text = decode_utf8($text); + +=pod + +The alternate implementation attempts to strip diacritics from the text, +assuming that the SIP client really wants just plain ASCII. This may be true +for older SIP clients. + +Note that a limitation of the diacritic-stripping method is that it doesn't +handle other scripts (such as Chinese) at all; expect it to fail miserably. + +=cut + +# Uncomment these lines to strip diacritics from the text +# $text = NFD($text); +# $text =~ s/\pM+//og; + + $text = encode_utf8($text); + + return $text; +} + sub format_date { my $class = shift; my $date = shift; === modified file 'Open-ILS/src/perlmods/OpenILS/SIP/Item.pm' --- Open-ILS/src/perlmods/OpenILS/SIP/Item.pm 2009-01-26 19:04:51 +0000 +++ Open-ILS/src/perlmods/OpenILS/SIP/Item.pm 2010-01-04 17:32:28 +0000 @@ -147,18 +147,19 @@ sub title_id { my $self = shift; my $t = ($self->{mods}) ? $self->{mods}->title : $self->{copy}->dummy_title; - $t =~ s/\pM+//og; + $t = OpenILS::SIP::clean_text($t); + return $t; } sub permanent_location { my $self = shift; - return $self->{volume}->owning_lib->name; + return OpenILS::SIP::clean_text($self->{volume}->owning_lib->name); } sub current_location { my $self = shift; - return $self->{copy}->circ_lib->name; + return OpenILS::SIP::clean_text($self->{copy}->circ_lib->name); } @@ -214,7 +215,7 @@ sub owner { my $self = shift; - return $self->{volume}->owning_lib->name; + return OpenILS::SIP::clean_text($self->{volume}->owning_lib->name); } sub hold_queue { @@ -263,14 +264,14 @@ # message to display on console sub screen_msg { my $self = shift; - return $self->{screen_msg} || ''; + return OpenILS::SIP::clean_text($self->{screen_msg}) || ''; } # reciept printer sub print_line { my $self = shift; - return $self->{print_line} || ''; + return OpenILS::SIP::clean_text($self->{print_line}) || ''; } === modified file 'Open-ILS/src/perlmods/OpenILS/SIP/Patron.pm' --- Open-ILS/src/perlmods/OpenILS/SIP/Patron.pm 2009-12-11 15:01:00 +0000 +++ Open-ILS/src/perlmods/OpenILS/SIP/Patron.pm 2010-01-04 17:32:50 +0000 @@ -92,9 +92,9 @@ sub name { my $self = shift; - my $u = $self->{user}; - return $u->first_given_name . ' ' . - $u->second_given_name . ' ' . $u->family_name; + my $u = $self->{user}; + return OpenILS::SIP::clean_text($u->first_given_name . ' ' . + $u->second_given_name . ' ' . $u->family_name); } sub home_library { @@ -105,15 +105,15 @@ } sub __addr_string { - my $addr = shift; - return "" unless $addr; - return $addr->street1 .' '. - $addr->street2 .' '. - $addr->city .' '. - $addr->county .' '. - $addr->state .' '. - $addr->country .' '. - $addr->post_code; + my $addr = shift; + return "" unless $addr; + return OpenILS::SIP::clean_text($addr->street1 .' '. + $addr->street2 .' '. + $addr->city .' '. + $addr->county .' '. + $addr->state .' '. + $addr->country .' '. + $addr->post_code); } sub address { @@ -128,12 +128,12 @@ sub email_addr { my $self = shift; - return $self->{user}->email; + return OpenILS::SIP::clean_text($self->{user}->email); } sub home_phone { my $self = shift; - return $self->{user}->day_phone; + return $self->{user}->day_phone; } sub sip_birthdate { @@ -145,7 +145,7 @@ sub ptype { my $self = shift; - return $self->{user}->profile->name; + return OpenILS::SIP::clean_text($self->{user}->profile->name); } sub language { @@ -323,7 +323,7 @@ ); my @holds; - push( @holds, $self->__hold_to_title($_) ) for @$holds; + push( @holds, OpenILS::SIP::clean_text($self->__hold_to_title($_)) ) for @$holds; return (defined $start and defined $end) ? [ $holds[($start-1)..($end-1)] ] : @@ -426,7 +426,7 @@ if(@return_datatype and $return_datatype[0]->{value} eq 'barcode') { push( @o, __circ_to_barcode($self->{editor}, $circid)); } else { - push( @o, __circ_to_title($self->{editor}, $circid)); + push( @o, OpenILS::SIP::clean_text(__circ_to_title($self->{editor}, $circid))); } } @overdues = @o; @@ -473,7 +473,7 @@ if(@return_datatype and $return_datatype[0]->{value} eq 'barcode') { push( @c, __circ_to_barcode($self->{editor}, $circid)); } else { - push( @c, __circ_to_title($self->{editor}, $circid)); + push( @c, OpenILS::SIP::clean_text(__circ_to_title($self->{editor}, $circid))); } } @@ -525,7 +525,7 @@ # retrieve the un-fleshed user object for update $u = $e->retrieve_actor_user($u->id); - my $note = $u->alert_message || ""; + my $note = OpenILS::SIP::clean_text($u->alert_message) || ""; $note = "CARD BLOCKED BY SELF-CHECK MACHINE\n$note"; # XXX Config option $u->alert_message($note); @@ -569,7 +569,7 @@ my $e = OpenILS::SIP->editor(); $INET_PRIVS = $e->retrieve_all_config_net_access_level() unless $INET_PRIVS; my ($level) = grep { $_->id eq $self->{user}->net_access_level } @$INET_PRIVS; - my $name = $level->name; + my $name = OpenILS::SIP::clean_text($level->name); syslog('LOG_DEBUG', "OILS: Patron inet_privs = $name"); return $name; }