Tian Jiang created IOTDB-1889:
---------------------------------
Summary: Adaptive TimeArray to reduce timestamp memory footprint
Key: IOTDB-1889
URL: https://issues.apache.org/jira/browse/IOTDB-1889
Project: Apache IoTDB
Issue Type: Improvement
Components: Core/Engine
Reporter: Tian Jiang
We use long arrays (64-bit integers) to store timestamps of timeseries points
in memory. Although longs may fit almost all situations, they may be
overfitting for general situations as consecutive timestamps usually have a
small difference, which is around the sampling interval (a few microseconds to
a few seconds). So it is straightforward that by only recording the differences
of adjacent timestamps, we can reduce the memory footprint of MemTables
greatly. For example, for double-typed timeseries, each time-value pair takes
8+8=16 bytes, and if we can switch the datatype of timestamps to short (int16),
6 bytes will be saved for each point, and it will result in a memory saving of
37.5%.
The idea is intuitive, we record the full timestamp of the first point, and
only record the differences of adjacent points in an integer array. Just like
the current implementation, the timestamps are batched with fixed-length
arrays, so for each such array, a full timestamp is maintained. Int64 is used
for the first array, and during the construction of an array, we keep track of
the largest difference, and if it can be recorded by a smaller integer, then we
use a smaller integer for the next array. For example, assuming the length of
each array is 4 and we have already written 5 timestamps: t0, t0+800, t0+1530,
t0+2430, then we will record t0 outside the array, and 0, 800, 730, 900 in the
array (we record a zero in the head to unify the access of timestamps, i.e., do
not use if-else for the first timestamp). Next, we find that the maximum
difference is 900, which can be represented by an int16, so we decide to use
int16 for the next array. On the other side, when overflow occurs, we use a
larger integer array and copy all values to the new array.
However, this optimization introduces two side-effects: first, as the full
timestamp is not stored anymore, we must traverse from the first position of
the array to calculate the true timestamps. Considering that such in-memory
data points do not support methods of visiting other than traverse, this
side-effect is not a big deal.
The second one is more threatening, we cannot sort the timestamps as previously
for the same reason. The simplest way is to switch to the previous method
(using int64 to store full timestamps) as soon as we detect any disorders. The
second method is more complicated: since sorting algorithms basically consist
of comparisons and swaps, to support comparison, we can store the differences
between the first timestamps and other timestamps instead of the differences
between two consecutive timestamps, so we can restore each timestamp in O(1)
time and comparison is also affordable. As for swaps, if the first position is
swapped, all other positions should be updated, otherwise, only the swapped
position needs updating.
We call the method "adaptive" for two reasons: first, the type of integers may
change as the differences of timestamps become larger or smaller; second, since
storing differences adding the cost of sorting when there are disordered data,
it should be disabled when the increased sorting cost overwhelms the benefit it
brings.
Despite all, the method has great potential in situations where data comes in
at a high frequency (so timestamp differences are small) and without any
disorders, and it is totally compatible with the previous method.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)