Dima, 1) Primary key lookups could become a bit faster, but no breakthrough is expected - there will be no need to jump from B+Tree leaf to data page, but the tree itself will be bigger, because data records will take more space than index records. I expect parity here.
2) We should observe dramatical improvement for scans (either ScanQuery or SqlQuery) because data will be stored sequentially within blocks. Consider the following case - a table with 10 records which could fit to 1 data page. In current approach (heap) these records could be located in anywhere from 1 data block to 10 different data blocks - it all depends on update timings and free lists. So you end up in 10 page lock/unlock cycles and up to 10 page reads, which will drive our LRU policy mad. In case of index-organized approach data will be stored in 1 block in the best case (sequential PK, no fragmentation), or 2-3 blocks in case of page splits or segmentation. Clearly, this would be a huge win in terms of locks, page reads and IO for scan workloads. 3) DML will be faster in case of sequential primary keys, e.g. (nearly) monotonic LONG as transaction identifier. In this case data will be laid out in a perfect sequential manner withing individual blocks, and in most cases INSERT will lead to 1 data page update and 1 WAL record. Compare it to 6 WAL record updates with current approach. On the other hand, random INSERTS (e.g. UUID key) could become slower due to page splits and fragmentation. Heap-organized storage is more preferable in this case. 4) Ideally we should not have index-per-partition, because in this case PK range scans which are typical on OLAP workloads and JOINs will be slow. In this case it would be not that easy to extract wipe out evicted partition. This is another trade off - fast operations on stable system at the cost of slower intermediate processes. On Tue, Nov 28, 2017 at 6:27 AM, Dmitriy Setrakyan <dsetrak...@apache.org> wrote: > Vladimir, > > I definitely like the overall direction. My comments are below... > > > On Mon, Nov 27, 2017 at 12:46 PM, Vladimir Ozerov <voze...@gridgain.com> > wrote: > > > > > I propose to adopt this approach in two phases: > > 1) Optionally add data to leaf pages. This should improve our ScanQuery > > dramatically > > > > Definitely a good idea. Shouldn't it make the primary lookups faster as > well? > > 2) Optionally has single primary index instead of per-partition index. This > > should improve our updates and SQL scans at the cost of harder rebalance > > and recovery. > > > > Can you explain why it would improve SQL updates and Scan queries? > > Also, why would this approach make rebalancing slower? If we keep the index > sorted by partition, then the rebalancing process should be able to grab > any partition at any time. Do you agree? > > D. >