I recently imported nearly 300,000 Unicorn records into Evergreen, many with accented characters (we have a lot of French and bilingual English-French material).
However, 17 records got rejected by direct_ingest.pl. A handful of
these rejects are our fault. However, I think I'm hitting some sort of
corner case for most of the others.
What I suspect is happening is that direct_ingest.pl rejects records
that have an accented character between square brackets ("[" and "])
in a field. For example, a record with the following 260 subfield will
be rejected:
<subfield code=\"b\">[Bibliothèque nationale du Canada],</subfield>
However, if I remove _either_ the square brackets _or_ the "̀",
the record will be successfully processed. My first guess was that the
square brackets ("[" and "]") would need to be escaped for JSON but
since all of these tags are in a string, the only character that needs
to be escaped are double-quotes (").
I should also note that other records with accented characters (e.g.,
"è") and/or square brackets were processed just fine, but I
think those with accented characters _between_ square brackets, as
above, are all in the reject pile (not 100% sure, but I'll be
verifying this ASAP).
I'm attaching a single problem record that's already been passed
through marc2bre.pl in case anyone wants to try reproducing the
problem. Field 260, subfield b contains the problem. I can provide
others if necessary.
I'll continue debugging but I would appreciate any insights!
Cheers,
Warren
eg_sample.bre.gz
Description: GNU Zip compressed data
