I recently imported nearly 300,000 Unicorn records into Evergreen,
many with accented characters (we have a lot of French and bilingual
English-French material).

However, 17 records got rejected by direct_ingest.pl. A handful of
these rejects are our fault. However, I think I'm hitting some sort of
corner case for most of the others.

What I suspect is happening is that direct_ingest.pl rejects records
that have an accented character between square brackets ("[" and "])
in a field. For example, a record with the following 260 subfield will
be rejected:

  <subfield code=\"b\">[Bibliothe&#x300;que nationale du Canada],</subfield>

However, if I remove _either_ the square brackets _or_ the "&#x300;",
the record will be successfully processed. My first guess was that the
square brackets ("[" and "]") would need to be escaped for JSON but
since all of these tags are in a string, the only character that needs
to be escaped are double-quotes (").

I should also note that other records with accented characters (e.g.,
"e&#x300;") and/or square brackets were processed just fine, but I
think those with accented characters _between_ square brackets, as
above, are all in the reject pile (not 100% sure, but I'll be
verifying this ASAP).

I'm attaching a single problem record that's already been passed
through marc2bre.pl in case anyone wants to try reproducing the
problem. Field 260, subfield b contains the problem. I can provide
others if necessary.

I'll continue debugging but I would appreciate any insights!

Cheers,
  Warren

Attachment: eg_sample.bre.gz
Description: GNU Zip compressed data

Reply via email to