Moving this to the incubator dev list (just looking it over again). Q -
if when a merge finishes there's a bigger backlog of components - will
it currently consider doing a more-ways merge? (Instead of 5, if there
are 13 sitting there when the 5-merge finishes - will a 13-merge be
initiated?) Just curious. We do probably need to think about some sort
of better flow control here - the storage gate should presumably slow
down admissions if it can't keep up - have to ponder what that might
mean. (I have a better idea of what it could mean for feeds than for
normal inserts.) One could argue that an increasing backlog is a sign
that we should be scaling out the number of partitions for the dataset
(future work but important work :-)).
Cheers,
Mike
On 4/15/15 2:33 PM, [email protected] wrote:
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium
New issue 868 by [email protected]: About prefix merge policy behavior
https://code.google.com/p/asterixdb/issues/detail?id=868
I describe how current prefix merge policy works based on the
observation from ingestion experiments.
Also, the similar observation was observed by Sattam as well.
The observed behavior seems a bit unexpected, so I post the
observation here to consider better merge policy and/or better lsm
index design regarding merge operations.
The aqls used for the experiment are shown at the end of this writing.
Prefix merge policy decides to merge disk components based on the
following conditions
1. Look at the candidate components for merging in oldest-first
order. If one exists, identify the prefix of the sequence of all such
components for which the sum of their sizes exceeds
MaxMergableComponentSize. Schedule a merge of those components into a
new component.
2. If a merge from 1 doesn't happen, see if the set of candidate
components for merging exceeds MaxToleranceComponentCnt. If so,
schedule a merge all of the current candidates into a new single
component.
Also, the prefix merge policy doesn't allow concurrent merge
operations for a single index partition.
In other words, if there is a scheduled or an on-going merge
operation, even if the above conditions are met, the merge operation
is not scheduled.
Based on this merge policy, the following situation can occur.
Suppose MaxToleranceCompCnt = 5 and 5 disk components were flushed to
disk.
When 5th disk component is flushed, the prefix merge policy schedules
a merge operation to merge the 5 components.
During the merge operation is scheduled and starts merging,
concurrently ingested records generates more disk components.
As long as a merge operation is not fast enough to catch up the speed
of generating 5 disk components by incoming ingested records,
the number of disk components increases as time goes.
So, the slower merge operations are, the more disk components there
will be as time goes.
I also attached a result of a command, "ls -alR <directory of the
asterixdb instance for an ingestion experiment>" which was executed
after the ingestion is over.
The attached file shows that for primary index (whose directory is
FsqCheckinTweet_idx_FsqCheckinTweet), ingestion generated 20 disk
components, where each disk component consists of btree (the filename
has suffix _b) and bloom filter (the filename has suffix_f) and
MaxMergableComponentSize is set to 1GB.
It also shows that for the secondary index (whose directory is
FsqCheckinTweet_idx_sifCheckinCoordinate), ingestion generated more
than 1400 components, where each disk component consist of a
dictionary btree (suffix: _b), an inverted list (suffix: _i), a
deleted-key btree (suffix: _d), and a bloom filter for the deleted-key
btree (suffix: _f).
Even if the ingestion was over, since our merge operation happens
asynchronously, the merge operation continues and eventually merge all
mergable disk components according to the describe merge policy.
------------------------------------------
AQLs for the ingestion experiment
------------------------------------------
drop dataverse STBench if exists;
create dataverse STBench;
use dataverse STBench;
create type FsqCheckinTweetType as closed {
id: int64,
user_id: int64,
user_followers_count: int64,
text: string,
datetime: datetime,
coordinates: point,
url: string?
}
create dataset FsqCheckinTweet (FsqCheckinTweetType) primary key id
/* this index type is only available kisskys/hilbertbtree branch.
however, you can easily replace sif index to inverted keyword index on
the text field and you will see similar behavior */
create index sifCoordinate on FsqCheckinTweet(coordinates) type
sif(-180.0, -90.0, 180.0, 90.0);
/* create feed */
create feed TweetFeed
using file_feed
(("fs"="localfs"),
("path"="127.0.0.1:////Users/kisskys/Data/SynFsqCheckinTweet.adm"),("format"="adm"),("type-name"="FsqCheckinTweetType"),("tuple-interval"="0"));
/* connect feed */
use dataverse STBench;
set wait-for-completion-feed "true";
connect feed TweetFeed to dataset FsqCheckinTweet;
Attachments:
storage-layout.txt 574 KB