Moving this to the incubator dev list (just looking it over again). Q - if when a merge finishes there's a bigger backlog of components - will it currently consider doing a more-ways merge? (Instead of 5, if there are 13 sitting there when the 5-merge finishes - will a 13-merge be initiated?) Just curious. We do probably need to think about some sort of better flow control here - the storage gate should presumably slow down admissions if it can't keep up - have to ponder what that might mean. (I have a better idea of what it could mean for feeds than for normal inserts.) One could argue that an increasing backlog is a sign that we should be scaling out the number of partitions for the dataset (future work but important work :-)).

Cheers,
Mike

On 4/15/15 2:33 PM, [email protected] wrote:
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 868 by [email protected]: About prefix merge policy behavior
https://code.google.com/p/asterixdb/issues/detail?id=868

I describe how current prefix merge policy works based on the observation from ingestion experiments.
Also, the similar observation was observed by Sattam as well.
The observed behavior seems a bit unexpected, so I post the observation here to consider better merge policy and/or better lsm index design regarding merge operations.

The aqls used for the experiment are shown at the end of this writing.

Prefix merge policy decides to merge disk components based on the following conditions 1. Look at the candidate components for merging in oldest-first order. If one exists, identify the prefix of the sequence of all such components for which the sum of their sizes exceeds MaxMergableComponentSize. Schedule a merge of those components into a new component. 2. If a merge from 1 doesn't happen, see if the set of candidate components for merging exceeds MaxToleranceComponentCnt. If so, schedule a merge all of the current candidates into a new single component. Also, the prefix merge policy doesn't allow concurrent merge operations for a single index partition. In other words, if there is a scheduled or an on-going merge operation, even if the above conditions are met, the merge operation is not scheduled.

Based on this merge policy, the following situation can occur.
Suppose MaxToleranceCompCnt = 5 and 5 disk components were flushed to disk. When 5th disk component is flushed, the prefix merge policy schedules a merge operation to merge the 5 components. During the merge operation is scheduled and starts merging, concurrently ingested records generates more disk components. As long as a merge operation is not fast enough to catch up the speed of generating 5 disk components by incoming ingested records,
the number of disk components increases as time goes.
So, the slower merge operations are, the more disk components there will be as time goes.

I also attached a result of a command, "ls -alR <directory of the asterixdb instance for an ingestion experiment>" which was executed after the ingestion is over. The attached file shows that for primary index (whose directory is FsqCheckinTweet_idx_FsqCheckinTweet), ingestion generated 20 disk components, where each disk component consists of btree (the filename has suffix _b) and bloom filter (the filename has suffix_f) and MaxMergableComponentSize is set to 1GB. It also shows that for the secondary index (whose directory is FsqCheckinTweet_idx_sifCheckinCoordinate), ingestion generated more than 1400 components, where each disk component consist of a dictionary btree (suffix: _b), an inverted list (suffix: _i), a deleted-key btree (suffix: _d), and a bloom filter for the deleted-key btree (suffix: _f). Even if the ingestion was over, since our merge operation happens asynchronously, the merge operation continues and eventually merge all mergable disk components according to the describe merge policy.

------------------------------------------
AQLs for the ingestion experiment
------------------------------------------
drop dataverse STBench if exists;
create dataverse STBench;
use dataverse STBench;

create type FsqCheckinTweetType as closed {
    id: int64,
    user_id: int64,
    user_followers_count: int64,
    text: string,
    datetime: datetime,
    coordinates: point,
    url: string?
}
create dataset FsqCheckinTweet (FsqCheckinTweetType) primary key id

/* this index type is only available kisskys/hilbertbtree branch. however, you can easily replace sif index to inverted keyword index on the text field and you will see similar behavior */ create index sifCoordinate on FsqCheckinTweet(coordinates) type sif(-180.0, -90.0, 180.0, 90.0);

/* create feed */
create feed  TweetFeed
using file_feed
(("fs"="localfs"),
("path"="127.0.0.1:////Users/kisskys/Data/SynFsqCheckinTweet.adm"),("format"="adm"),("type-name"="FsqCheckinTweetType"),("tuple-interval"="0"));

/* connect feed */
use dataverse STBench;
set wait-for-completion-feed "true";
connect feed TweetFeed to dataset FsqCheckinTweet;




Attachments:
    storage-layout.txt  574 KB


Reply via email to