[
https://issues.apache.org/jira/browse/IGNITE-6025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vladimir Ozerov resolved IGNITE-6025.
-------------------------------------
Resolution: Duplicate
Separate issues were created for each idea:
1) IGNITE-6406
2) IGNITE-6057
3) IGNITE-6407
> SQL: improve CREATE INDEX performance
> -------------------------------------
>
> Key: IGNITE-6025
> URL: https://issues.apache.org/jira/browse/IGNITE-6025
> Project: Ignite
> Issue Type: Task
> Components: persistence, sql
> Affects Versions: 2.1
> Reporter: Vladimir Ozerov
> Labels: performance
>
> When bulk data load is performed, it is considered a good practice to bypass
> certain facilities of underlying engine to achieve greater throughput. E.g.,
> TX or MVCC managers can by bypassed, global table lock can be held instead of
> fine-grained page/row/field locks, etc..
> Another widely used technique is to drop table indexes and re-build them form
> scratch when load finished. This is now possible with help of {{CREATE
> INDEX}} command which could be executed in runtime. However, experiments with
> large data sets shown that {{DROP INDEX}} -> load -> {{CREATE INDEX}} is
> *much slower* than simple load to indexed table. The reasons for this are
> both inefficient implementation of {{CREATE INDEX}} command, as well as some
> storage architectural decisions.
> 1) Index is created by a single thread; probably multiple threads could speed
> it up and the cost of higher CPU usage. But how to split iteration between
> several threads?
> 2) Cache iteration happens through primary index. So we read an index page,
> but to read entries we have to navigate to data page. If single data page is
> referenced from N places in the index tree, we will read it N times. This
> leads to bad cache locality in memory-only case, and to excessive disk IO in
> case of persistence. This could be avoided, if we iterate over data pages,
> and index all data from a single page at once.
> 3) Another widely used technique is building BTree in bottom-up approach.
> That is, we sort all data rows first, then build leafs, then go one level up,
> etc.. This approach could give us the best build performance possible,
> especially if p.2 is implemented. However it is the most difficult
> optimization, which will require implementation of spilling to disk if result
> set is too large.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)