[ 
https://issues.apache.org/jira/browse/ASTERIXDB-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Luo updated ASTERIXDB-2541:
--------------------------------
    Description: 
Our currently AsynchronousScheduler tries to schedule all merge operations at 
the same without any control. This is not optimal in terms of minimizing the 
number of disk components, which directly impacts query performance.

Here we introduce GreedyScheduler to minimize the number of disk components 
over time. It keeps tracks of all merge operations of an LSM index, and only 
activates the merge operation with the smallest number of remaining I/Os. It 
can be proven that if the number of components is the same for all merge 
operations, then this GreedyScheduler is strictly optimal. Otherwise, this will 
still be a good heuristic.

In order for GreedyScheduler to work, we need the following two changes:
* Keep track of the number of scanned pages of index cursors so that we will 
know how many pages left;
* Introduce a mechanism to activate/deactivate merge operations

NOTE: GreedyScheduler should only be used during runtime (with a controlled 
data arrival process) so that it can reduce the number of disk components at 
its best effort. It CANNOT be used when benchmarking the system by writing as 
fast as possible since large merges will be starved. The measured write 
throughput will be high but unsustainable.

  was:
Our currently AsynchronousScheduler tries to schedule all merge operations at 
the same without any control. This is not optimal in terms of minimizing the 
number of disk components, which directly impacts query performance.

Here we introduce GreedyScheduler to minimize the number of disk components 
over time. It keeps tracks of all merge operations of an LSM index, and only 
activates the merge operation with the smallest number of remaining I/Os. It 
can be proven that if the number of components is the same for all merge 
operations, then this GreedyScheduler is strictly optimal. Otherwise, this will 
still be a good heuristic.

In order for GreedyScheduler to work, we need the following two changes:
* Keep track of the number of scanned pages of index cursors so that we will 
know how many pages left;
* Introduce a mechanism to activate/deactivate merge operations


> Introduce GreedyScheduler
> -------------------------
>
>                 Key: ASTERIXDB-2541
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2541
>             Project: Apache AsterixDB
>          Issue Type: Improvement
>          Components: STO - Storage
>            Reporter: Chen Luo
>            Assignee: Chen Luo
>            Priority: Major
>
> Our currently AsynchronousScheduler tries to schedule all merge operations at 
> the same without any control. This is not optimal in terms of minimizing the 
> number of disk components, which directly impacts query performance.
> Here we introduce GreedyScheduler to minimize the number of disk components 
> over time. It keeps tracks of all merge operations of an LSM index, and only 
> activates the merge operation with the smallest number of remaining I/Os. It 
> can be proven that if the number of components is the same for all merge 
> operations, then this GreedyScheduler is strictly optimal. Otherwise, this 
> will still be a good heuristic.
> In order for GreedyScheduler to work, we need the following two changes:
> * Keep track of the number of scanned pages of index cursors so that we will 
> know how many pages left;
> * Introduce a mechanism to activate/deactivate merge operations
> NOTE: GreedyScheduler should only be used during runtime (with a controlled 
> data arrival process) so that it can reduce the number of disk components at 
> its best effort. It CANNOT be used when benchmarking the system by writing as 
> fast as possible since large merges will be starved. The measured write 
> throughput will be high but unsustainable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to