https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=42880

--- Comment #1 from Martin Renvoize (ashimema) 
<[email protected]> ---
Created attachment 200647
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=200647&action=edit
Bug 42880: Avoid the USMARC round-trip when building ES documents

For every record, marc_records_to_documents serialized it to USMARC and then
re-parsed it with MARC::Record->new_from_usmarc, purely to "prove it will
round-trip" before storing marc_data. Profiling shows that re-decode is ~20-25%
of the per-record mapping CPU.

The round-trip only needs to fire for records that cannot be represented in
ISO 2709. Those cases are cheaply detectable from the record itself:
 - the total length overflows the 5-digit leader length (> 99999),
 - a single field overflows the 4-digit directory length (> 9999), or
 - field data contains raw ISO 2709 separators (0x1d/0x1e/0x1f).
as_usmarc() carps only about the first. We now check all three and only run the
full round-trip decode when one is hit; on the happy path it is skipped.

The separator check makes the skip provably safe rather than relying on
MARC::Record's tolerance of stray separators. Those characters are invalid in
XML and are stripped on save by bug 35104, so in practice they never reach
indexing -- this bug depends on 35104 and the check is belt-and-suspenders.

The MARCXML fallback for non-representable records is unchanged.

Test plan:
1. prove t/db_dependent/Koha/SearchEngine/Elasticsearch.t
2. Confirm a normal record is stored as base64ISO2709 and round-trips.
3. Confirm an over-long / over-long-field record falls back to MARCXML.

-- 
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[email protected]
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

Reply via email to