On 1/31/17 16:59, Thomas Munro wrote:
> I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
> the start of the string via the UCharIterator passed in, unless you
> have the rare reverse-accent-sort feature enabled (maybe used only in
> fr_CA, it looks like it is required to scan the whole string looking
> for the last accent).  So I assume that uiter_setUTF8 and
> ucol_nextSortKeyPart would usually do a small fixed amount of work,
> whereas this patch's icu_to_uchar allocates space and converts the
> whole variable length string every time.

I have changed it to use ucol_nextSortKeyPart() when appropriate.

> That's about abbreviation, but I note that you can also compare
> strings using iterators with ucol_strcollIter, avoiding the need to
> allocate and transcode up front.  I have no idea whether that'd pay
> off.

These iterators don't actually transcode on the fly.  You can only set
up iterators for UTF-8 or UTF-16 strings.  And for UTF-8, we already
have a special code path using ucol_strcollUTF8(), which uses an
interator internally.

> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
> and the ICU major version changes we expect to have to REINDEX, but
> does anyone know if such data changes are ever pulled into the minor
> version package upgrades you get from regular apt-get update of (say)
> a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
> expect collversion changes to happen basically any time in response to
> regular system updates, or only when you're doing a major upgrade of
> some kind, as the above-quoted documentation patch implies?

Stable operating systems shouldn't do major library upgrades, so I feel
pretty confident about this.

> +               collversion = ntohl(*((uint32 *) versioninfo));
> UVersionInfo is an array of four uint8_t.  That doesn't sound like
> something that needs to be bit-swizzled... is it really?  Even if it
> is arranged differently on different architectures, I'm not sure why
> you care since we only ever use it to compare for equality on the same
> system.  But aside from that, I don't love this cast to uint32.  I
> wonder if we should use u_versionToString to respect boundaries a bit
> better?

Makes sense, changed the column type to text.

> I have another motivation for wanting to model collation versions as
> strings: I have been contemplating a version check for system-provided
> collations too, piggy-backing on your work here.  Obviously PostgreSQL
> can't directly know anything about system collation versions, but I'm
> thinking of a GUC system_collation_version_command which you could
> optionally set to a script that knows how to inspect the local
> operating system.  For example, a package maintainer might be
> interested in writing such a script that knows how to ask the package
> system for a locale data version.  A brute force approach that works
> on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.

That sounds like an idea worth pursuing.

Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to