On Mon, Mar 28, 2016 at 1:21 PM, Peter Geoghegan <p...@heroku.com> wrote:
> On Mon, Mar 28, 2016 at 12:08 AM, Oleg Bartunov <obartu...@gmail.com> > wrote: > > Should we start thinking about ICU ? I compare Postgres with ICU and > without > > and found 27x improvement in btree index creation for russian strings. > This > > includes effect of abbreviated keys and ICU itself. Also, we'll get > system > > independent locale. > > I think we should. I want to develop a detailed proposal before > talking about it more, though, because the idea is controversial. > > Did you use the FreeBSD ports patch? Do you have your own patch that > you could share? > We'll post the patch. Teodor made something to get abbreviated keys work as I remember. I should say, that 27x improvement I got on my macbook. I will check on linux. > > I'm not surprised that ICU is so much faster, especially now that > UTF-8 is not a second class citizen (it's been possible to build ICU > to specialize all its routines to handle UTF-8 for years now). As you > may know, ICU supports partial sort keys, and sort key compression, > which may have also helped: > http://userguide.icu-project.org/collation/architecture > > > That page also describes how binary sort keys are versioned, which > allows them to be stored on disk. It says "A common example is the use > of keys to build indexes in databases". We'd be crazy to trust Glibc > strxfrm() to be stable *on disk*, but ICU already cares deeply about > the things we need to care about, because it's used by other database > systems like DB2, Firebird, and in some configurations SQLite [1]. > > Glibc strxfrm() is not great with codepoints from the Cyrillic > alphabet -- it seems to store 2 bytes per code-point in the primary > weight level. So ICU might also do better in your test case for that > reason. > Yes, I see on this page, that ICU is ~3 times faster for russian text. http://site.icu-project.org/charts/collation-icu4c48-glibc > > [1] > https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/icu/README.txt > -- > Peter Geoghegan >