Tian Jiang created IOTDB-2189:
---------------------------------

             Summary: Shared chunks to reduce I/Os with massive timeseries
                 Key: IOTDB-2189
                 URL: https://issues.apache.org/jira/browse/IOTDB-2189
             Project: Apache IoTDB
          Issue Type: Improvement
          Components: Core/Engine, Core/TsFile
            Reporter: Tian Jiang
         Attachments: image-2021-12-22-12-03-03-966.png

When the number of timeseries explodes, the average memory for each series is 
very limited.
For example, when there are 10 million timeseries, storing 100 points for each 
series results in  1billion points in memory. If each point has an 8-byte 
timestamp and an 8-byte value, the memory footprint will be 16GB. In this case, 
each timeseries will generate a chunk of  only 100 points, which has the size 
of less than 1KB (after encoded) when flushed to the disk.

As a chunk is the I/O unit during queries, the extremely small chunk size will 
significantly reduce I/O performance. Moreover, as the number of points is 
small, some encoding algorithm may not work very well. Compation may solve the 
problems to some extent, but compaction itself also suffers from small chunks.

We notice that timeseries is generally queried together. For example, device 
queries read all timeseries of one or more devices and compactions also read 
timeseries in a batched manner. So, if we encapsulate more than one timeseries 
in a chunk, the chunk size can be much larger and the I/O efficiency is greatly 
improved. Moreover, the enlarged chunk size may also improve compression ratio.

 !image-2021-12-22-12-03-03-966.png! 
The figure above shows the alternation. When 3 timeseries are put into the same 
chunk, one single I/O of timeseries0 can fetch all of them. As the chunk is 
cached, the other two timeseries can use the chunk so additional I/O is avoided.

The disadvantage is also obvious. If only some timeseries in a chunk is not 
queried, the bandwidth may be wasted. So the point is to choose wisely what 
timeseries should be grouped together while others not. One alternative is to 
simply group timeseries of the same device, provided whole-device queries are 
very common. A more sophisticate method could be based on statistics or even 
machine learning. The method can also be dynamic, as it only affects the newly 
generated chunks.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to