Hi, I perform it directly.
Some supplements: - The version I used is the 0.11.0-SNAPSHOT in master branch. The specific commit does not matter because we do not modify the query engine much after 0.10.0. - Each time I execute the query, I restart the IoTDB to avoid the influence of cache in query. - Another factor that accelerates query is the number of tsfile is decreased, i.e., the number of TsFileMetadata need to read is decreased. Thanks, -- Jialin Qiao School of Software, Tsinghua University 乔嘉林 清华大学 软件学院 > -----原始邮件----- > 发件人: "Julian Feinauer" <[email protected]> > 发送时间: 2020-07-09 15:39:57 (星期四) > 收件人: "[email protected]" <[email protected]> > 抄送: > 主题: Re: [Experiment sharing] How chunk size(number of points) impact the > query performance > > Hey, > > very interesting experiment. > Did you use something like JMH fort he Benchmark or did you perform it > directly? > > Julian > > Am 09.07.20, 09:29 schrieb "孙泽嵩" <[email protected]>: > > Hi Jialin, > > Great experiment! Thanks for your sharing. > > Looking forward to the function of hot compaction. > > > Best, > ----------------------------------- > Zesong Sun > School of Software, Tsinghua University > > 孙泽嵩 > 清华大学 软件学院 > > > 2020年7月8日 16:39,Jialin Qiao <[email protected]> 写道: > > > > Hi, > > > > > > I'd like to share with you some experiment results about how chunk size > impact the query performance. > > > > > > Hardware: > > MacBook Pro (Retina, 15-inch, Mid 2015) > > CPU: 2.2 GHz Intel Core i7 > > Memory: 16 GB 1600 MHz DDR3 > > I use a mobile HDD (SEAGATE, 1TB, Model SRD00F1) as the storage. > > > > > > Workload: 1 storage group, 1 device, 100 measurements in long type. 1 > million data points generated randomly for each time series. > > > > > > A background knowledge is the origin flushed chunk size = > memtable_size_threshold / series number / byte per data point (16 for long > data points) > > > > > > I adjust the memtable_size_threhold to control the chunk size. > > > > > > Configurations of IoTDB: > > > > > > enable_parameter_adapter=false > > avg_series_point_number_threshold=10000000 (to make the > memtable_size_threshold valid) > > page_size_in_byte=1000000000 (each chunk has one page) > > tsfile_size_threshold = memtable_size_threshold = > 160000/1600000/16000000/160000000/1600000000 > > > > > > I use SessionExample.insertTablet to insert data under different > configurations. Then I got Chunk sizes from 100 to 1000000. > > > > > > Then I use SessionExample.queryByIterator to iterate the result set of > "select s1 from root.sg1.d1" without constructing other data structures. > > > > > > The results are: > > > > > > | chunk size | query time cost in ms | > > | 100 | 47620 | > > | 1000 | 13984 | > > | 10000 | 2416 | > > | 100000 | 1322 | > > > > > > As we could see the chunk size has a dominate impact to the raw data > query performance. In the current query engine, Chunk is the basic data unit > to read from the disk. For reading each Chunk, we need one seek + one IO > operation. A larger chunk size means less Chunks to read. > > > > > > Therefore, it's better to enlarge the memtable_size_threshold for > accelerate queries. However, enlarging memtable_size_threshold means more > memory is needed. This is not always satisfied in some scenes. Therefore, we > need compaction, either hot compaction triggered in flushing or the timed > compaction strategy, to compact small chunks to a large one. > > > > > > Thanks, > > -- > > Jialin Qiao > > School of Software, Tsinghua University > > > > 乔嘉林 > > 清华大学 软件学院 > >
