That seems like a smart solution Nick.

Adam

> On Nov 19, 2021, at 7:28 AM, Robert Newson <b...@rsn.io> wrote:
> 
> Noting that the upgrade channel for views was misconceived (by me) as there 
> is no version number in the header for them. You’d need to add it. 
> 
> B. 
> 
>> On 18 Nov 2021, at 07:12, Nick Vatamaniuc <vatam...@gmail.com> wrote:
>> 
>> Thinking more about this issue I wonder if we can avoid resetting and
>> rebuilding everything from scratch, and instead, let the upgrade
>> happen in the background, while still serving the existing view data.
>> 
>> The realization was that collation doesn't affect the emitted keys and
>> values themselves, only their order in the view b-trees. That means
>> we'd just have to rebuild b-trees, and that is exactly what our view
>> compactor already does.
>> 
>> When we detect a libicu version discrepancy we'd submit the view for
>> compaction. We even have a dedicated "upgrade" [1] channel in smoosh
>> which handles file version format upgrades, but we'll tweak that logic
>> to trigger on libicu version mismatches as well.
>> 
>> Would this work? Does anyone see any issue with that approach?
>> 
>> [1] 
>> https://github.com/apache/couchdb/blob/3.x/src/smoosh/src/smoosh_server.erl#L435-L442
>> 
>> Cheers,
>> -Nick
>> 
>> 
>> 
>>> On Fri, Oct 29, 2021 at 7:01 PM Nick Vatamaniuc <vatam...@apache.org> wrote:
>>> 
>>> Hello everyone,
>>> 
>>> CouchDB by default uses the libicu library to sort its view rows.
>>> When views are built, we do not record or track the version of the
>>> collation algorithm. The issue is that the ICU library may modify the
>>> collation order between major libicu versions, and when that happens,
>>> views built with the older versions may experience data loss. I wanted
>>> to discuss the option to record the libicu collator version in each
>>> view then warn the user when there is a mismatch. Also, optionally
>>> ignore the mismatch, or automatically rebuild the views.
>>> 
>>> Imagine, for example, searching patient records using start/end keys.
>>> It could be possible that, say, the first letter of their name now
>>> collates differently in a new libicu. That would prevent the patient
>>> record from showing up in the view results for some important
>>> procedure or medication. Users might not even be aware of this kind of
>>> data loss occurring, there won't be any error in the API or warning in
>>> the logs.
>>> 
>>> I was thinking how to solve this. There were a few commits already to
>>> cleanup our collation drivers [1], expose libicu and collation
>>> algorithm version in the new _versions endpoint [2], and some other
>>> minor fixes in that area. As the next steps we could:
>>> 
>>> 1) Modify our views to keep track of the collation algorithm
>>> version. We could attempt to transparently upgrade the view header
>>> format -- read the old view file, update the header with an extra
>>> libicu collation version field, that updates the signature, and then,
>>> save the file with the new header and new signature. This avoids view
>>> rebuilds, just records the collator version in the view and moves the
>>> files to a new name.
>>> 
>>> 2) Do what PostgreSQL does, and 2a) emit a warning with the view
>>> results when the current libicu version doesn't match the version in
>>> the view [3]. That means altering the view results to add a "warning":
>>> "..." field. Another alternative 2b) is emit a warning in the
>>> _design/$ddoc/_info only. Users would have to know that after an OS
>>> version upgrade, or restoring backups, to make sure to look at their
>>> _design/$ddoc/_info for each db for each ddoc. Of course, there may be
>>> users which used the "raw" collation option, or know they are using
>>> just the plain ASCII character sets in their views. So we'd have a
>>> configuration setting to ignore the warnings as well.
>>> 
>>> 3) Users who see the warning, could then either rebuild the view
>>> with the new collator library manually, or it could happen
>>> automatically based on a configuration option, basically "when
>>> collator versions are miss-matched, invalidate and rebuild all the
>>> views".
>>> 
>>> 4) We'd have a way for the users to assert (POST a ddoc update) that
>>> they double-checked the new ICU version and are convinced that a
>>> particular view would not experience data loss with the new collator.
>>> That should make the warning go away, and the view to not be rebuilt.
>>> This can't be just a naive "collator" option setting as both per-view
>>> and per-design options are used when computing the view signature, and
>>> any changes there would result in the view being rebuilt. Perhaps we
>>> can add it to the design docs as a separate option which is excluded
>>> from the signature hash, like the "autoupdate" setting for background
>>> index builder ("collation_version_accept"?). PostgreSQL also offers
>>> this option with the ALTER COLLATION ... REFRESH VERSION command [3]
>>> 
>>> What do we think, is this a reasonable approach? Is there something
>>> easier / simpler we can do?
>>> 
>>> Thanks!
>>> -Nick
>>> 
>>> [1] 
>>> https://github.com/apache/couchdb/pull/3746/commits/28f26f52fe2e170d98658311dafa8198d96b8061
>>> [2] 
>>> https://github.com/apache/couchdb/commit/c1bb4e4856edd93255d75ebe158b4da38bbf3333
>>> [3] https://www.postgresql.org/docs/13/sql-altercollation.html
> 

Reply via email to