Murtadha Hubail created ASTERIXDB-1337:
------------------------------------------
Summary: Dataset Memory Management on Multi-Partition NC
Key: ASTERIXDB-1337
URL: https://issues.apache.org/jira/browse/ASTERIXDB-1337
Project: Apache AsterixDB
Issue Type: Improvement
Components: AsterixDB, Storage
Reporter: Murtadha Hubail
Priority: Minor
Currently, each dataset has a fixed memory budget - total virtual buffer cache
(VBC) budget - which is configurable by the following attributes in asterix
configuration file:
storage.memorycomponent.pagesize (Default 128K)
storage.memorycomponent.numpages (Default 256 pages)
Note: a different attributes are used for Metadata datasets.
During query compilation, any index that will be accessed uses
AbstractLSMIndexDataflowHelperFactory which is passed an instance of
AsterixVirtualBufferCacheProvider.
Each dataset has a single (AsterixVirtualBufferCacheProvider), which makes all
indexes and their partitions (different IO devices) of this dataset on the same
node get access to the same dataset VBC.
During runtime, when the AbstractLSMIndexDataflowHelperFactory is used to
create the actual IndexDataflowHelper, the dataset VBC is initialized. The
total VBC budget of the dataset is divided into a number of VBCs which is
configurable in asterix configuration file as:
storage.memorycomponent.numcomponents (Default 2 VBCs)
Each one of those VBCs is created as an object of type
MultitenantVirtualBufferCache (MVBC) (in
DatasetLifecycleManager#initializeDatasetVirtualBufferCache). The size of each
of these MVBC is (storage.memorycomponent.numpages /
storage.memorycomponent.numcomponents). Even though the dataset VBCs have been
initialized, no memory is allocated yet. This avoids the memory allocation of
disk read only queries or bulkload DDLs.
Upon the first modification (on LSMHarness#modify) of any index partition that
belongs to this dataset, we allocate the memory on all MVBCs that we
initialized earlier. This makes all indexes and their partitions of the dataset
on the same node compete on the budget of a single MVBC at a time. Once a MVBC
is full, all files opened in it are scheduled to be flushed, and we switch to
another MVBC (if any is available). This will have the effect of making
frequent flushes of many small files, which will lead to frequent merges.
I think that it would be better if each partition (IO device on the node) has
its own MVBC budget.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)