Re: StorageGroupProcessor.sequenceFileList is ordered by fileName rather than dataTime

Xiangdong Huang Tue, 10 Dec 2019 07:36:45 -0800

Hi,

>  so I think it should be dumped into iotdb as an unseq file and sorted in
memory with the original files.


I do not think so.  Putting the file to unseq folder will decrease the
query speed (at least for current implementation, as I know).
In my opinion, if a part of data (notice that I am not saying a file. I
want to say a part of data) can be considered as ordered data (i.e.,
sequence data), putting it to the seq folder may friendly for queries.

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


atoiLiu <[email protected]> 于2019年12月10日周二 下午11:16写道：

> Hi，
>
> I think the semantics of load are the same as insert, except this insert
> is a sealed file, so I think it should be dumped into iotdb as an unseq
> file and sorted in memory with the original files.
>
> This may cause queries to be very slow, but we should prompt the user to
> do a merge command ??
>
> > 在 2019年12月10日，下午9:04，Xiangdong Huang <[email protected]> 写道：
> >
> > Hi,
> >
> > I think it is a bug in the `load` function now, and needs to be fixed
> > quickly.
> >
> > Firstly, let's consider that there is no `load` function.
> > In this case, the files will have the same order no matter you use which
> > device's timeline as the ordering dimension.
> >
> > (Second, in your case, can we put the tsfile 105 into the sequence files?
> > Condition: all devices in a flushing memetable can be set in a time hole
> of
> > the sequence files.)
> >
> > Third, lets's consider that if the `load` function is enable.
> >
> > The worest case is that you add a file  which has two devices (device 1
> and
> > device2), and if you use device1's timeline to order files, it is between
> > F2 and F3, while it is between F1 and F2 if you use device2's timeline.
> >
> > device1: F1   F2   _HOLE__ F3
> > device2: F1  __HOLE__ F2  F3
> >
> > Then, why not split the file into two files?
> >
> > Best,
> > -----------------------------------
> > Xiangdong Huang
> > School of Software, Tsinghua University
> >
> > 黄向东
> > 清华大学 软件学院
> >
> >
> > Jialin Qiao <[email protected]> 于2019年12月10日周二 下午7:05写道：
> >
> >> Hi,
> >>
> >> Things become complicated when the load file feature is introduced in
> >> IoTDB. The newly added data file may contain many devices with different
> >> time intervals. Therefore, one order of TsFileResources is insufficient.
> >> A possible solution is to sort the TsFileResources temporarily when
> >> querying.
> >>
> >> Thanks,
> >> Jialin Qiao
> >>
> >> Lei Rui (Jira) <[email protected]> 于2019年12月9日周一 上午12:14写道：
> >>
> >>> Lei Rui created IOTDB-346:
> >>> -----------------------------
> >>>
> >>>             Summary: StorageGroupProcessor.sequenceFileList is ordered
> >> by
> >>> fileName rather than dataTime
> >>>                 Key: IOTDB-346
> >>>                 URL: https://issues.apache.org/jira/browse/IOTDB-346
> >>>             Project: Apache IoTDB
> >>>          Issue Type: Bug
> >>>            Reporter: Lei Rui
> >>>
> >>>
> >>> `StorageGroupProcessor.sequenceFileList` is ordered by fileName rather
> >>> than by time of data, as reflected in the
> >>> `StorageGroupProcessor.getAllFiles` method code:
> >>> {code:java}
> >>> tsFiles.sort(this::compareFileName);
> >>> {code}
> >>> ----
> >>> I use the following examples to expose the bug when the order of
> fileName
> >>> is inconsistent with that of dataTime.
> >>>
> >>> First, for preparation, I created three tsfiles using the following
> sql:
> >>> {code:java}
> >>> SET STORAGE GROUP TO root.ln.wf01.wt01
> >>> CREATE TIMESERIES root.ln.wf01.wt01.status WITH DATATYPE=BOOLEAN,
> >>> ENCODING=PLAIN
> >>> CREATE TIMESERIES root.ln.wf01.wt01.temperature WITH DATATYPE=DOUBLE,
> >>> ENCODING=PLAIN
> >>> CREATE TIMESERIES root.ln.wf01.wt01.hardware WITH DATATYPE=INT32,
> >>> ENCODING=PLAIN
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(1, 1.1, false, 11)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(2, 2.2, true, 22)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(3, 3.3, false, 33)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(4, 4.4, false, 44)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(5, 5.5, false, 55)
> >>> flush
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(100, 100.1, false, 110)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(150, 200.2, true, 220)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(200, 300.3, false, 330)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(250, 400.4, false, 440)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(300, 500.5, false, 550)
> >>> flush
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(10, 10.1, false, 110)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(20, 20.2, true, 220)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(30, 30.3, false, 330)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(40, 40.4, false, 440)
> >>> INSERT INTO root.ln.wf01.wt01(timestamp,temperature,status, hardware)
> >>> values(50, 50.5, false, 550)
> >>> flush
> >>> {code}
> >>> The tsfiles created are organized in the following directory structure:
> >>> {code:java}
> >>> |data
> >>> |--sequence
> >>> |----root.ln.wf01.wt01
> >>> |------1575813520203-101-0.tsfile
> >>> |------1575813520203-101-0.tsfile.resource
> >>> |------1575813520669-103-0.tsfile
> >>> |------1575813520669-103-0.tsfile.resource
> >>> |--unsequence
> >>> |----root.ln.wf01.wt01
> >>> |------1575813521063-105-0.tsfile
> >>> |------1575813521063-105-0.tsfile.resource
> >>> {code}
> >>> ||File Name||Data Time||
> >>> |(a) 1575813520203-101-0.tsfile|1-5|
> >>> |(c) 1575813521063-105-0.tsfile|10-50|
> >>> |(b) 1575813520669-103-0.tsfile|100-300|
> >>>
> >>> Note how the order of fileName is inconsistent with that of dataTime.
> >>>
> >>> By the way, if you look into the code, you will know how the file name
> is
> >>> generated:
> >>> {code:java}
> >>> System.currentTimeMillis() + IoTDBConstant.TSFILE_NAME_SEPARATOR +
> >>> versionController.nextVersion() + IoTDBConstant.TSFILE_NAME_SEPARATOR +
> >> "0"
> >>> + TSFILE_SUFFIX
> >>> {code}
> >>> ----
> >>> Then, I loaded the three tsfiles into another brand new IoTDB. I did
> two
> >>> experiments with different loading orders each.
> >>>
> >>> In the first experiment, the tsfiles were loaded in their data time
> >> order.
> >>> That is,
> >>> {code:java}
> >>> IoTDB> load 1575813520203-101-0.tsfile // tsfile (a), with data time
> 1-5
> >>> IoTDB> load 1575813521063-105-0.tsfile // tsfile (c), with data time
> >> 10-50
> >>> IoTDB> load 1575813520669-103-0.tsfile // tsfile (b), with data time
> >>> 100-300{code}
> >>> After loading successfully, I did the following query in the same
> client
> >>> window and got the wrong result:
> >>> {code:java}
> >>> IoTDB> select * from root
> >>>
> >>>
> >>
> +-----------------------------------+-----------------------------+-----------------------------+-----------------------------+
> >>> |                               Time|root.ln.wf01.wt01.temperature|
> >>> root.ln.wf01.wt01.status|   root.ln.wf01.wt01.hardware|
> >>>
> >>>
> >>
> +-----------------------------------+-----------------------------+-----------------------------+-----------------------------+
> >>> |      1970-01-01T08:00:00.001+08:00|                          1.1|
> >>>                false|                           11|
> >>> |      1970-01-01T08:00:00.002+08:00|                          2.2|
> >>>                 true|                           22|
> >>> |      1970-01-01T08:00:00.003+08:00|                          3.3|
> >>>                false|                           33|
> >>> |      1970-01-01T08:00:00.004+08:00|                          4.4|
> >>>                false|                           44|
> >>> |      1970-01-01T08:00:00.005+08:00|                          5.5|
> >>>                false|                           55|
> >>> |      1970-01-01T08:00:00.100+08:00|                        100.1|
> >>>                false|                          110|
> >>> |      1970-01-01T08:00:00.150+08:00|                        200.2|
> >>>                 true|                          220|
> >>> |      1970-01-01T08:00:00.200+08:00|                        300.3|
> >>>                false|                          330|
> >>> |      1970-01-01T08:00:00.250+08:00|                        400.4|
> >>>                false|                          440|
> >>> |      1970-01-01T08:00:00.300+08:00|                        500.5|
> >>>                false|                          550|
> >>> |      1970-01-01T08:00:00.010+08:00|                         10.1|
> >>>                false|                          110|
> >>> |      1970-01-01T08:00:00.020+08:00|                         20.2|
> >>>                 true|                          220|
> >>> |      1970-01-01T08:00:00.030+08:00|                         30.3|
> >>>                false|                          330|
> >>> |      1970-01-01T08:00:00.040+08:00|                         40.4|
> >>>                false|                          440|
> >>> |      1970-01-01T08:00:00.050+08:00|                         50.5|
> >>>                false|                          550|
> >>>
> >>>
> >>
> +-----------------------------------+-----------------------------+-----------------------------+-----------------------------+
> >>> Total line number = 15
> >>> It costs 0.198s
> >>> {code}
> >>> I checked the data directory of the loaded server and it looks like
> this:
> >>> {code:java}
> >>> |data
> >>> |--sequence
> >>> |----root.ln.wf01.wt01
> >>> |------1575813520203-101-0.tsfile
> >>> |------1575813520203-101-0.tsfile.resource
> >>> |------1575813520669-103-0.tsfile
> >>> |------1575813520669-103-0.tsfile.resource
> >>> |------1575813521063-105-0.tsfile
> >>> |------1575813521063-105-0.tsfile.resource
> >>> |--unsequence{code}
> >>> ----
> >>> In the second experiment, the tsfiles were loaded in their file name
> >>> order. That is,
> >>> {code:java}
> >>> IoTDB> load 1575813520203-101-0.tsfile // tsfile (a), with data time
> 1-5
> >>> IoTDB> load 1575813520669-103-0.tsfile // tsfile (b), with data time
> >>> 100-300
> >>> IoTDB> load 1575813521063-105-0.tsfile // tsfile (c), with data time
> >>> 10-50{code}
> >>> Note that I was expected the tsfile (c) be loaded as into the
> unsequence
> >>> data directory.
> >>>
> >>> After loading successfully, I did the following query in the same
> client
> >>> window and got the CORRECT result:
> >>> {code:java}
> >>> IoTDB> select * from root
> >>>
> >>>
> >>
> +-----------------------------------+-----------------------------+-----------------------------+-----------------------------+
> >>> |                               Time|root.ln.wf01.wt01.temperature|
> >>> root.ln.wf01.wt01.status|   root.ln.wf01.wt01.hardware|
> >>>
> >>>
> >>
> +-----------------------------------+-----------------------------+-----------------------------+-----------------------------+
> >>> |      1970-01-01T08:00:00.001+08:00|                          1.1|
> >>>                false|                           11|
> >>> |      1970-01-01T08:00:00.002+08:00|                          2.2|
> >>>                 true|                           22|
> >>> |      1970-01-01T08:00:00.003+08:00|                          3.3|
> >>>                false|                           33|
> >>> |      1970-01-01T08:00:00.004+08:00|                          4.4|
> >>>                false|                           44|
> >>> |      1970-01-01T08:00:00.005+08:00|                          5.5|
> >>>                false|                           55|
> >>> |      1970-01-01T08:00:00.010+08:00|                         10.1|
> >>>                false|                          110|
> >>> |      1970-01-01T08:00:00.020+08:00|                         20.2|
> >>>                 true|                          220|
> >>> |      1970-01-01T08:00:00.030+08:00|                         30.3|
> >>>                false|                          330|
> >>> |      1970-01-01T08:00:00.040+08:00|                         40.4|
> >>>                false|                          440|
> >>> |      1970-01-01T08:00:00.050+08:00|                         50.5|
> >>>                false|                          550|
> >>> |      1970-01-01T08:00:00.100+08:00|                        100.1|
> >>>                false|                          110|
> >>> |      1970-01-01T08:00:00.150+08:00|                        200.2|
> >>>                 true|                          220|
> >>> |      1970-01-01T08:00:00.200+08:00|                        300.3|
> >>>                false|                          330|
> >>> |      1970-01-01T08:00:00.250+08:00|                        400.4|
> >>>                false|                          440|
> >>> |      1970-01-01T08:00:00.300+08:00|                        500.5|
> >>>                false|                          550|
> >>>
> >>>
> >>
> +-----------------------------------+-----------------------------+-----------------------------+-----------------------------+
> >>> Total line number = 15
> >>> It costs 0.267s
> >>> {code}
> >>> I looked into the data directory of the loaded server and surprisingly
> it
> >>> is the same as in the first experiment. Further in the second
> >> experiment, I
> >>> restarted the server and the client, and queried again. This time, the
> >>> result is wrong again as that of the first experiment.
> >>>
> >>> *There is a special confusing point of the second experiment*: why the
> >>> tsfile (c) is not loaded as an unsequence tsfile? Why did the query
> >>> executed immediately after the three tsfiles were loaded get the
> CORRECT
> >>> result?
> >>>
> >>>
> >>>
> >>> --
> >>> This message was sent by Atlassian Jira
> >>> (v8.3.4#803005)
> >>>
> >>
> >>
> >> --
> >> —————————————————
> >> Jialin Qiao
> >> School of Software, Tsinghua University
> >>
> >> 乔嘉林
> >> 清华大学 软件学院
> >>
>
>

Re: StorageGroupProcessor.sequenceFileList is ordered by fileName rather than dataTime

Reply via email to