On Mon, Mar 28, 2016 at 1:21 PM, Peter Geoghegan <p...@heroku.com> wrote:

> On Mon, Mar 28, 2016 at 12:08 AM, Oleg Bartunov <obartu...@gmail.com>
> wrote:
> > Should we start thinking about ICU ? I compare Postgres with ICU and
> without
> > and found 27x improvement in btree index creation for russian strings.
> This
> > includes effect of abbreviated keys and ICU itself. Also, we'll get
> system
> > independent locale.
> I think we should. I want to develop a detailed proposal before
> talking about it more, though, because the idea is controversial.
> Did you use the FreeBSD ports patch? Do you have your own patch that
> you could share?

 We'll post the patch. Teodor made something to get abbreviated keys work
I remember. I should say, that 27x improvement I got on my macbook. I will
check on linux.

> I'm not surprised that ICU is so much faster, especially now that
> UTF-8 is not a second class citizen (it's been possible to build ICU
> to specialize all its routines to handle UTF-8 for years now). As you
> may know, ICU supports partial sort keys, and sort key compression,
> which may have also helped:
> http://userguide.icu-project.org/collation/architecture

> That page also describes how binary sort keys are versioned, which
> allows them to be stored on disk. It says "A common example is the use
> of keys to build indexes in databases". We'd be crazy to trust Glibc
> strxfrm() to be stable *on disk*, but ICU already cares deeply about
> the things we need to care about, because it's used by other database
> systems like DB2, Firebird, and in some configurations SQLite [1].
> Glibc strxfrm() is not great with codepoints from the Cyrillic
> alphabet -- it seems to store 2 bytes per code-point in the primary
> weight level. So ICU might also do better in your test case for that
> reason.

Yes, I see on this page, that ICU is ~3 times faster for russian text.

> [1]
> https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt
> --
> Peter Geoghegan

Reply via email to