Hello everyone,

CouchDB by default uses the libicu library to sort its view rows.
When views are built, we do not record or track the version of the
collation algorithm. The issue is that the ICU library may modify the
collation order between major libicu versions, and when that happens,
views built with the older versions may experience data loss. I wanted
to discuss the option to record the libicu collator version in each
view then warn the user when there is a mismatch. Also, optionally
ignore the mismatch, or automatically rebuild the views.

Imagine, for example, searching patient records using start/end keys.
It could be possible that, say, the first letter of their name now
collates differently in a new libicu. That would prevent the patient
record from showing up in the view results for some important
procedure or medication. Users might not even be aware of this kind of
data loss occurring, there won't be any error in the API or warning in
the logs.

I was thinking how to solve this. There were a few commits already to
cleanup our collation drivers [1], expose libicu and collation
algorithm version in the new _versions endpoint [2], and some other
minor fixes in that area. As the next steps we could:

  1) Modify our views to keep track of the collation algorithm
version. We could attempt to transparently upgrade the view header
format -- read the old view file, update the header with an extra
libicu collation version field, that updates the signature, and then,
save the file with the new header and new signature. This avoids view
rebuilds, just records the collator version in the view and moves the
files to a new name.

  2) Do what PostgreSQL does, and 2a) emit a warning with the view
results when the current libicu version doesn't match the version in
the view [3]. That means altering the view results to add a "warning":
"..." field. Another alternative 2b) is emit a warning in the
_design/$ddoc/_info only. Users would have to know that after an OS
version upgrade, or restoring backups, to make sure to look at their
_design/$ddoc/_info for each db for each ddoc. Of course, there may be
users which used the "raw" collation option, or know they are using
just the plain ASCII character sets in their views. So we'd have a
configuration setting to ignore the warnings as well.

  3) Users who see the warning, could then either rebuild the view
with the new collator library manually, or it could happen
automatically based on a configuration option, basically "when
collator versions are miss-matched, invalidate and rebuild all the
views".

  4) We'd have a way for the users to assert (POST a ddoc update) that
they double-checked the new ICU version and are convinced that a
particular view would not experience data loss with the new collator.
That should make the warning go away, and the view to not be rebuilt.
This can't be just a naive "collator" option setting as both per-view
and per-design options are used when computing the view signature, and
any changes there would result in the view being rebuilt. Perhaps we
can add it to the design docs as a separate option which is excluded
from the signature hash, like the "autoupdate" setting for background
index builder ("collation_version_accept"?). PostgreSQL also offers
this option with the ALTER COLLATION ... REFRESH VERSION command [3]

What do we think, is this a reasonable approach? Is there something
easier / simpler we can do?

Thanks!
-Nick

[1] 
https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
[2] 
https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
[3] https://www.postgresql.org/docs/13/sql-altercollation.html

Reply via email to