Re: [Experiment sharing] How chunk size(number of points) impact the query performance

Julian Feinauer Thu, 09 Jul 2020 00:40:17 -0700

Hey,

very interesting experiment.
Did you use something like JMH fort he Benchmark or did you perform it directly?


Julian

Am 09.07.20, 09:29 schrieb "孙泽嵩" <[email protected]>:

    Hi Jialin,

    Great experiment! Thanks for your sharing.

    Looking forward to the function of hot compaction.


    Best,
    -----------------------------------
    Zesong Sun
    School of Software, Tsinghua University

    孙泽嵩
    清华大学 软件学院

    > 2020年7月8日 16:39，Jialin Qiao <[email protected]> 写道：
    > 
    > Hi,
    > 
    > 
    > I'd like to share with you some experiment results about how chunk size 
impact the query performance. 
    > 
    > 
    > Hardware: 
    > MacBook Pro (Retina, 15-inch, Mid 2015)
    > CPU: 2.2 GHz Intel Core i7
    > Memory: 16 GB 1600 MHz DDR3
    > I use a mobile HDD (SEAGATE, 1TB, Model SRD00F1)  as the storage.
    > 
    > 
    > Workload: 1 storage group, 1 device, 100 measurements in long type. 1 
million data points generated randomly for each time series. 
    > 
    > 
    > A background knowledge is the origin flushed chunk size = 
memtable_size_threshold / series number / byte per data point (16 for long data 
points)
    > 
    > 
    > I adjust the memtable_size_threhold to control the chunk size.
    > 
    > 
    > Configurations of IoTDB:
    > 
    > 
    > enable_parameter_adapter=false
    > avg_series_point_number_threshold=10000000 (to make the 
memtable_size_threshold valid)
    > page_size_in_byte=1000000000 (each chunk has one page)
    > tsfile_size_threshold = memtable_size_threshold = 
160000/1600000/16000000/160000000/1600000000
    > 
    > 
    > I use SessionExample.insertTablet to insert data under different 
configurations. Then I got Chunk sizes from 100 to 1000000.
    > 
    > 
    > Then I use SessionExample.queryByIterator to iterate the result set of 
"select s1 from root.sg1.d1" without constructing other data structures.
    > 
    > 
    > The results are:
    > 
    > 
    > | chunk size | query time cost in ms |
    > |   100          |     47620                     |
    > |   1000        |     13984                     |
    > |   10000      |     2416                       |
    > |   100000    |     1322                       |
    > 
    > 
    > As we could see the chunk size has a dominate impact to the raw data 
query performance. In the current query engine, Chunk is the basic data unit to 
read from the disk. For reading each Chunk, we need one seek + one IO 
operation. A larger chunk size means less Chunks to read. 
    > 
    > 
    > Therefore, it's better to enlarge the memtable_size_threshold for 
accelerate queries. However, enlarging memtable_size_threshold means more 
memory is needed. This is not always satisfied in some scenes. Therefore, we 
need compaction, either hot compaction triggered in flushing or the timed 
compaction strategy, to compact small chunks to a large one.
    > 
    > 
    > Thanks,
    > --
    > Jialin Qiao
    > School of Software, Tsinghua University
    > 
    > 乔嘉林
    > 清华大学 软件学院

Re: [Experiment sharing] How chunk size(number of points) impact the query performance

Reply via email to