Hi Team,
Can anyone shed some light on why postgres 11 is extremely slow in my case?
I am making a mirror of zh.wikisource.org and I have downloaded 303049 pages
and stored them in a postgres 11 database.
My postgres instance is based on docker image postgres:11 and runs on my
MacBook Pro i7 16GB.
Database schema is as follows
Table pages(id, url, html, downloaded, inserted_at, updated_at) and books(id,
name, info, preface, text, html, url, parent_id, inserted_at, updated_at,
info_html, preface_html)
A wikisource web page is downloaded and its html text is inserted into table
“pages” column “html.
Later, books.{name, info, preface, text, html, info_html, preface_html} are
extracted from pages.html. The text column of books is a txt version of the
content of html column of table pages.
On average there are 7635 characters (each characters is 3 bytes long because
of utf-8 encoding) for text column of table books and I want to add full text
search to books(text).
I tried pg_trgm and my own customized token parser
https://github.com/huangjimmy/pg_cjk_parser
To my surprise, postgres 11 is extremely slow when creating a full text index.
I added a column of tsvector type and tried to create an index on that column.
Pg could not finish creating a GIN index for a long time and I had to cancel
the execution.
I then tried to create a partial full text index for 500 rows and it took
postgres 2 to 3 minutes to create the index. Based on this estimation, pg will
need at least one day to create a full GIN full text search index for 303049
rows of data. I think this is ridiculous slow.
If I tried to create fts index for books(name) or books(info), it took just 3
minutes to create the index. However, name and info are extremely short
compared to books(text).
I switched to Elasticsearch and it turned out that Elasticsearch is extremely
efficient for my case. It took Elasticsearch 3 hours to index all 303049 rows.
Jimmy Huang
[email protected]