Excellent catch, David! Please file a bug for that, if you haven't already, the use of MARC 880 for CJK languages is quite common in US libraries.
Does Zebra handle 4 byte characters correctly? --Barton On Fri, Mar 24, 2017 at 2:08 AM, David Cook <dc...@prosentient.com.au> wrote: > Hi all, > > > > I ran into a problem this afternoon where MySQL was silently truncating my > UTF8 encoded string in biblio_metadata.metadata and biblio.notes. > Admittedly, the string had some funny looking data*, but it was all valid > UTF8. I had zero problems with XML::LibXML or MARC::Record. However, > whenever I would use DBI or the mysql command line client, I’d get a > truncated string**. This meant a completely invalid XML record which is > irretrievable from the database using Koha’s XML handling methods. > > > > However, I eventually found https://bugs.mysql.com/bug.php?id=53005 where > they state that the MySQL ‘utf8’ charset has a 3 byte limit ( > https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html). That > tends to be fine in most cases, but 𝔤 is 4 bytes (f0 9d 94 a4 > http://dev.networkerror.org/utf8/?start=120100&end=120200& > cols=4&show_uni_int=on&show_uni_hex=on&show_html_ent=on& > show_raw_hex=on&show_raw_bin=on). It was on this character that the > truncation happened. Apparently, there is a 4 byte UTF-8 charset called > utf8mb4: https://dev.mysql.com/doc/refman/5.7/en/charset-unicode- > utf8mb4.html. If you “SET NAMES = ‘utf8mb4’” for your client and set the > charset for your database column, that will allow you to correctly insert > the data. > > > > So the truncation doesn’t generate an error but it does generate a > warning, which is much more obvious in the mysql CLI client than when using > DBI: > > MariaDB [devkohadcook]> update biblio set abstract = 'We introduce a new > perspective and a generalization of spectral networks for 4d N=2" > style="position: relative;" tabindex="0" id="MathJax-Element-1-Frame" > class="MathJax">=2 theories of class S" style="position: relative;" > tabindex="0" id="MathJax-Element-2-Frame" class="MathJax"> associated > to Lie algebras g=An" style="position: relative;" tabindex="0" > id="MathJax-Element-3-Frame" class="MathJax">𝔤=An, Dn" > style="position: relative;" tabindex="0" id="MathJax-Element-4-Frame" > class="MathJax">Dn, E6" style="position: relative;" tabindex="0" > id="MathJax-Element-5-Frame" class="MathJax">E6, and E7" > style="position: relative;" tabindex="0" id="MathJax-Element-6-Frame" > class="MathJax">E7. Spectral networks directly compute the BPS spectra > of 2d theories on surface defects coupled to the 4d theories. A Lie > algebraic interpretation of these spectra emerges naturally from our > construction, leading to a new description of 2d-4d wall-crossing > phenomena. Our construction also provides an efficient framework for the > study of BPS spectra of the 4d theories. In addition, we consider novel > types of surface defects associated with minuscule representations of g" > style="position: relative;" tabindex="0" id="MathJax-Element-7-Frame" > class="MathJax">𝔤.' where biblionumber = 50314; > > Query OK, 0 rows affected, 1 warning (0.00 sec) > > Rows matched: 1 Changed: 0 Warnings: 1 > > > > Because it doesn’t raise an exception, I’d say it’s silent. You have to > check for it. Using mysql, you can send “SHOW WARNINGS” after the insert > query and you can see something like: > > +---------+------+------------------------------------------ > --------------------------------------+ > > | Level | Code | Message > | > > +---------+------+------------------------------------------ > --------------------------------------+ > > | Warning | 1366 | Incorrect string value: '\xF0\x9D\x94\xA4=A...' for > column 'abstract' at row 1 | > > +---------+------+------------------------------------------ > --------------------------------------+ > > 1 row in set (0.01 sec) > > > > With DBI, it’s harder: http://stackoverflow.com/ > questions/7946655/dbi-perl-logging-mysql-warnings. Apparently you can use > $dbh->{mysql_warning_count} to check for warnings, and then send “SHOW > WARNINGS” via the same database handle and that should give you your > warning result. Or you can run MySQL in “traditional” mode which treats > warnings as errors. > > > > So this seems like an edgecase… it’s some wonky data (albeit valid UTF8 > data) in an otherwise English record. > > > > However, according to MySQL (I haven’t verified for myself), Chinese, > Japanese, and Korean sometimes use 4 bytes for characters. So people using > these languages with Koha could be getting truncated data without realising > that’s what is happening or why. > > > > I’ve tried out utf8mb4 and it works great. I think the downside is that it > uses more space for certain types of columns because it needs to allocate > more space for that 4th byte. I don’t know whether we should consider > switching from utf8 to utf8mb4, so that we’re inclusive of CJK and comply > with RFC3629 which states that there’s a 4 byte maximum. > > > > I think it’s worth discussion though. This wouldn’t just affect libraries > in Asia. It would affect any library which stores records with CJK > characters in a MARC 880 “Alternate Graphic Representation” field. > > > > I’m at the end of my Friday work day, so I’m going to leave it there, but > food for thought! > > > > > > > > > > *Aforementioned strange data: > > We introduce a new perspective and a generalization of spectral networks > for 4d N=2" style="position: relative;" tabindex="0" > id="MathJax-Element-1-Frame" class="MathJax">=2 theories of class S" > style="position: relative;" tabindex="0" id="MathJax-Element-2-Frame" > class="MathJax"> associated to Lie algebras g=An" style="position: > relative;" tabindex="0" id="MathJax-Element-3-Frame" class="MathJax">𝔤=An, > Dn" style="position: relative;" tabindex="0" id="MathJax-Element-4-Frame" > class="MathJax">Dn, E6" style="position: relative;" tabindex="0" > id="MathJax-Element-5-Frame" class="MathJax">E6, and E7" > style="position: relative;" tabindex="0" id="MathJax-Element-6-Frame" > class="MathJax">E7. Spectral networks directly compute the BPS spectra > of 2d theories on surface defects coupled to the 4d theories. A Lie > algebraic interpretation of these spectra emerges naturally from our > construction, leading to a new description of 2d-4d wall-crossing > phenomena. Our construction also provides an efficient framework for the > study of BPS spectra of the 4d theories. In addition, we consider novel > types of surface defects associated with minuscule representations of g" > style="position: relative;" tabindex="0" id="MathJax-Element-7-Frame" > class="MathJax">𝔤. > > > > ** Truncated string: > > We introduce a new perspective and a generalization of spectral networks > for 4d N=2" style="position: relative;" tabindex="0" > id="MathJax-Element-1-Frame" class="MathJax">=2 theories of class S" > style="position: relative;" tabindex="0" id="MathJax-Element-2-Frame" > class="MathJax"> associated to Lie algebras g=An" style="position: > relative;" tabindex="0" id="MathJax-Element-3-Frame" class="MathJax">𝔤 > > > > David Cook > > Systems Librarian > > Prosentient Systems > > 72/330 Wattle St > > Ultimo, NSW 2007 > > Australia > > > > Office: 02 9212 0899 > > Direct: 02 8005 0595 > > > > _______________________________________________ > Koha-devel mailing list > Koha-devel@lists.koha-community.org > http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel > website : http://www.koha-community.org/ > git : http://git.koha-community.org/ > bugs : http://bugs.koha-community.org/ >
_______________________________________________ Koha-devel mailing list Koha-devel@lists.koha-community.org http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/