https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=29697
--- Comment #2 from Jonathan Druart <[email protected]> --- ?(In reply to Katrin Fischer from comment #1) > Hi Joubu, can you give an example for such characters that would be > stripped? I want to help, but not sure about why it was added. Short answer? I don't know. The long answer is a rabbit hole. The regex is 391 $str =~ s/[^\x09\x0A\x0D\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//g; And it's related to: https://en.wikipedia.org/wiki/Valid_characters_in_XML U+0009, U+000A, U+000D: these are the only C0 controls accepted in XML 1.0; U+0020–U+D7FF, U+E000–U+FFFD: this excludes some (not all) non-characters in the BMP (all surrogates, U+FFFE and U+FFFF are forbidden); U+10000–U+10FFFF: this includes all code points in supplementary planes, including non-characters. So, some weird characters/non-characters :) > Maybe it would be enough to do it on import/saving a record? Yes, that's why I was suggesting actually with "Either we assume the MARC::XML that is stored is correct, or we need to add more StripNonXmlChars calls." Also note that Galen wrote, as the time: commit b549d7e1f1b7d518e16fa48af7360a38e8233fec Date: Fri Feb 8 16:35:18 2008 -0600 added StripNonXmlChars to C4::Charset "StripNonXmlChars should not necessarily be used, as it may be better to reject a file or record if it contains that kind of encoding error." We ended up using it from almost everywhere, inconsistently. -- You are receiving this mail because: You are watching all bug changes. _______________________________________________ Koha-bugs mailing list [email protected] https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
