Vladimir Ozerov created IGNITE-6025:
---------------------------------------
Summary: SQL: improve CREATE INDEX performance
Key: IGNITE-6025
URL: https://issues.apache.org/jira/browse/IGNITE-6025
Project: Ignite
Issue Type: Improvement
Components: sql
Affects Versions: 2.1
Reporter: Vladimir Ozerov
Fix For: 2.2
When bulk data load is performed, it is considered a good practice to bypass
certain facilities of underlying engine to achieve greater throughput. E.g., TX
or MVCC managers can by bypassed, global table lock can be held instead of
fine-grained page/row/field locks, etc..
Another widely used technique is to drop table indexes and re-build them form
scratch when load finished. This is now possible with help of {{CREATE INDEX}}
command which could be executed in runtime. However, experiments with large
data sets shown that {{DROP INDEX}} -> load -> {{CREATE INDEX}} is *much
slower* than simple load to indexed table. The reasons for this are both
inefficient implementation of {{CREATE INDEX}} command, as well as some storage
architectural decisions.
1) Index is created by a single thread; probably multiple threads could speed
it up and the cost of higher CPU usage. But how to split iteration between
several threads?
2) Cache iteration happens through primary index. So we read an index page, but
to read entries we have to navigate to data page. If single data page is
referenced from N places in the index tree, we will read it N times. This leads
to bad cache locality in memory-only case, and to excessive disk IO in case of
persistence. This could be avoided, if we iterate over data pages, and index
all data from a single page at once.
3) Another widely used technique is building BTree in bottom-up approach. That
is, we sort all data rows first, then build leafs, then go one level up, etc..
This approach could give us the best build performance possible, especially if
p.2 is implemented. However it is the most difficult optimization, which will
require implementation of spilling to disk if result set is too large.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)