Re: Re: Avoid long-tail insertion

Xiangdong Huang Fri, 28 Jun 2019 05:46:39 -0700

I see. The following sentence is very important to guarantee the
correctness...


>  At the same time, resubmitting UFP after the end of the task in each
Flush thread ensures that all tasks can be executed.

I think it should be highlight listed, rather than be drown in the last
paragraph.

Anyway, now I understand what you want to do.

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Xiangdong Huang <[email protected]> 于2019年6月28日周五 下午8:41写道：

> Hi Tianan,
>
> > the flush tasks in each UFP(There is a flushing queue in UFP) need to be
> executed sequentially.
> > Flush thread polls the first UFP from UFP queue in FlushManager -> polls
> the first flush task in UFP -> completes the flush task -> set
> ‘managedByFlushManager’  of the UFP to false.
>
> It indicates that there are more than one flush tasks in a UFP, but the
> FlushManager  just spends one task from the UFP and then it mark the UFP as
> managedByFlushManager=false and poll it out of the queue? (So, when to
> flush the rest tasks??)
>
> > (1) UFP is not managed by Flush Manager, i.e.'managedByFlushManager' is
> false
>
> If suppose the UFP has one flush task and enqueued the queue. Then the UFP
> has the second flush task. However, it can not be added into the queue
> because managedByFlushManager == true. Using your above logic, you will
> ignore the second flush task....
>
>
> > Flush Manager first determines whether UFP meets the criteria for
> submission
>
> Do you want to say both the two criteria should be satisfied? (If so, the
> above  hypothetical situation will occur).
>
> > Through the above design, we can ensure that at the same time for each
> UFP, Flush Manager will only manage at most once and execute at most one
> flush task, while there is no restriction between different UFPs.
>
> Using your design, we can ensure that at a certain time, given a UFP, the
> Flush Manager will only manage a UFP at most once and execute at most one
> flush task from the UFP, but how to avoid the above hypothetical situation?
>
> Best,
> -----------------------------------
> Xiangdong Huang
> School of Software, Tsinghua University
>
>  黄向东
> 清华大学 软件学院
>
>
> 李天安 <[email protected]> 于2019年6月28日周五 上午11:29写道：
>
>> Hi，
>>  I'm also involved in the design of the new storage engine. Let me
>> complement the new design of the flush task.
>>
>>  To improve system performance, we changed flush tasks from synchronous
>> to asynchronous. We introduced a Flush Manager to manage all flush tasks.
>> The tricky problem is that each Unsealed TsFile Processor (UFP) corresponds
>> to a data file on a disk, so the flush tasks in each UFP(There is a
>> flushing queue in UFP) need to be executed sequentially. However, flush
>> tasks in different UFPs have no sequential requirements. How to design them
>> to meet the above requirements?
>>
>>  We introduce a UFP FIFO queue in Flush Manager, and add a boolean
>> attribute ‘managedByFlushManager’ to each UFP to indicate whether it is
>> managed by Flush Manager. Flush Manager maintains a Flush thread pool to
>> perform Flush tasks, so the lifecycle of a Flush task is
>> 1. UFP are submitted to FlushManager，FlushManager add UFP to its queue
>> and set ‘managedByFlushManager’  of the UFP to true.
>> 2. The Flush Pool in FlushManager start a flush thread to execute task.
>> 3. Flush thread polls the first UFP from UFP queue in FlushManager ->
>> polls the first flush task in UFP -> completes the flush task -> set
>> ‘managedByFlushManager’  of the UFP to false.
>>
>> There are two ways to submit a UFP to FlushManager:
>> 1. UFP, whenever a MemTable reaches a certain size or forcibly triggers a
>> flush task, it submits itself to Flush Manager (because the queue in Flush
>> Manager is UFP). Flush Manager first determines whether UFP meets the
>> criteria for submission:
>> (1) UFP is not managed by Flush Manager, i.e.'managedByFlushManager' is
>> false
>> (2) The Flush task queue in UFP is not empty, that is, there are at least
>> one flush task to be executed.
>>
>> 2. When the Flush thread completes the flush task, it sets
>> ‘managedByFlushManager’ to false and resubmits the UFP of the completed
>> flush task to the FlushManager.
>>
>> Through the above design, we can ensure that at the same time for each
>> UFP, Flush Manager will only manage at most once and execute at most one
>> flush task, while there is no restriction between different UFPs. At the
>> same time, resubmitting UFP after the end of the task in each Flush thread
>> ensures that all tasks can be executed. Therefore, we solve the above
>> problem and the design meets the requirements of Flush Manager.
>>
>> Best Regards,
>> -------------------------------------
>> Tianan Li
>> School of Software, Tsinghua University
>>
>> > -----原始邮件-----
>> > 发件人: "Jialin Qiao" <[email protected]>
>> > 发送时间: 2019-06-27 11:27:24 (星期四)
>> > 收件人: [email protected]
>> > 抄送:
>> > 主题: Re: Avoid long-tail insertion
>> >
>> > Hi,
>> >
>> > The new storage engine is designed to have the following components:
>> >
>> > (1) MemTable: A memory structure, which stores all inserted data in
>> memory.
>> >
>> > (2) MemtablePool: Manages all memtables. All memtables are gotten from
>> this pool. The total number of memtables is fixed
>> > in the system. Once the pool do not has available memtables, the
>> getMemtable() operation will wait or directly return.
>> >
>> > (3) UnsealedTsFileProcessor (UFP): A writer for one data file. It
>> always has one working memtable that receives writes and a
>> > list (flushing list) of memtables that for flush. Once the working
>> memtable reaches a threshold, it will be moved to the
>> > flushing list and the working memtable is set null. When a new write
>> arrives, if the working memtable is null, UFP will
>> > call getMemtable() of the MemtablePool to get one as the working
>> memtable.
>> >
>> > (4) StorageGroupProcessor (SGP): Each SGP is responsible for all writes
>> and reads in one storage group. It always has one
>> > working UFP that receives write and a list (closing list) of UFPs that
>> for close. Once the file size of the working UFP reaches
>> > a threshold, the UFP is moved to the closing list and the working UFP
>> is set null. When a new write arrives, if the working UFP
>> > is null, a new UFP is generated as working UFP and receives write.
>> >
>> > (5) StorageGroupManager (SGM): A manager of all SGPs in IoTDB. It is
>> only responsible for routing read and write operations
>> > to its corresponding SGP according to the deviceId of the operation.
>> >
>> > (6) Flush thread: The flush thread poll a memtable from the flushing
>> list in UFP and flush a memtable to disk. After flushing,
>> > the memtable is returned to the MemtablePool.
>> >
>> > These are only the main components of the new storage engine. Some
>> things may be lost. It would be great if someone could
>> > give some advices or supplementations.
>> >
>> > Best,
>> > --
>> > Jialin Qiao
>> > School of Software, Tsinghua University
>> >
>> > 乔嘉林
>> > 清华大学 软件学院
>> >
>> > > -----原始邮件-----
>> > > 发件人: "Jialin Qiao" <[email protected]>
>> > > 发送时间: 2019-06-24 20:24:05 (星期一)
>> > > 收件人: [email protected]
>> > > 抄送:
>> > > 主题: Re: Re: Re: Avoid long-tail insertion
>> > >
>> > >
>> > > Yes, there are many changes. The branch I am working on is
>> feature_async_close_tsfile.
>> > > Anyone interested is welcome to join and discuss.
>> > >
>> > > Best,
>> > > --
>> > > Jialin Qiao
>> > > School of Software, Tsinghua University
>> > >
>> > > 乔嘉林
>> > > 清华大学 软件学院
>> > >
>> > > > -----原始邮件-----
>> > > > 发件人: "Xiangdong Huang" <[email protected]>
>> > > > 发送时间: 2019-06-23 10:59:29 (星期日)
>> > > > 收件人: [email protected]
>> > > > 抄送:
>> > > > 主题: Re: Re: Avoid long-tail insertion
>> > > >
>> > > > Hi,
>> > > >
>> > > > Once your work branch is almost ready, let me know so I can help to
>> review.
>> > > > I think it is a HUGE PR...
>> > > >
>> > > > -----------------------------------
>> > > > Xiangdong Huang
>> > > > School of Software, Tsinghua University
>> > > >
>> > > >  黄向东
>> > > > 清华大学 软件学院
>> > > >
>> > > >
>> > > > Jialin Qiao <[email protected]> 于2019年6月22日周六 下午9:57写道：
>> > > >
>> > > > > Hi Xiangdong,
>> > > > >
>> > > > > I will merge this patch. Let "Directories" manage the folders of
>> both
>> > > > > sequence and unSequence files is good.
>> > > > >
>> > > > > However, the naming of "Directories" is not clear. It would be
>> better to
>> > > > > rename to "DirectoryManager"
>> > > > >
>> > > > > Best,
>> > > > > --
>> > > > > Jialin Qiao
>> > > > > School of Software, Tsinghua University
>> > > > >
>> > > > > 乔嘉林
>> > > > > 清华大学 软件学院
>> > > > >
>> > > > > > -----原始邮件-----
>> > > > > > 发件人: "Xiangdong Huang" <[email protected]>
>> > > > > > 发送时间: 2019-06-22 16:35:29 (星期六)
>> > > > > > 收件人: [email protected]
>> > > > > > 抄送:
>> > > > > > 主题: Re: Avoid long-tail insertion
>> > > > > >
>> > > > > > Hi jialin,
>> > > > > >
>> > > > > > I submit some modifications for:
>> > > > > >
>> > > > > > * add the overflow data folder location setting in the
>> > > > > > iotdb-engine.properties;
>> > > > > > * let Directories.java to manage the above folder.
>> > > > > >
>> > > > > > If you need to refactor the overflow when you solving the long
>> tail
>> > > > > issue,
>> > > > > > you can apply the patch from [1] first to simplify your work.
>> > > > > >
>> > > > > > [1]
>> > > > > >
>> > > > >
>> https://issues.apache.org/jira/secure/attachment/12972547/overflow-folder.patch
>> > > > > >
>> > > > > > Best,
>> > > > > > -----------------------------------
>> > > > > > Xiangdong Huang
>> > > > > > School of Software, Tsinghua University
>> > > > > >
>> > > > > >  黄向东
>> > > > > > 清华大学 软件学院
>> > > > > >
>> > > > > >
>> > > > > > Xiangdong Huang <[email protected]> 于2019年6月22日周六 下午3:19写道：
>> > > > > >
>> > > > > > > If you change the process like this, i.e., there are more
>> than one
>> > > > > > > unsealed TsFiles for each storage group, then  you have to
>> modify the
>> > > > > WAL
>> > > > > > > module.. Because current WAL module only recognizes the last
>> unsealed
>> > > > > > > TsFile..
>> > > > > > >
>> > > > > > > By the way, "sealed" is better than "closed", I think..  A
>> sealed file
>> > > > > > > means the file which has the magic string at the head and the
>> tail.
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > -----------------------------------
>> > > > > > > Xiangdong Huang
>> > > > > > > School of Software, Tsinghua University
>> > > > > > >
>> > > > > > >  黄向东
>> > > > > > > 清华大学 软件学院
>> > > > > > >
>> > > > > > >
>> > > > > > > Jialin Qiao <[email protected]> 于2019年6月22日周六
>> 下午2:54写道：
>> > > > > > >
>> > > > > > >>
>> > > > > > >> Hi, I am solving the long-tail latency problem.
>> > > > > > >>
>> > > > > > >> There are some cases (blocking points) that blocking the
>> insertion.
>> > > > > For a
>> > > > > > >> better understanding of this problem, I first introduce the
>> writing
>> > > > > process
>> > > > > > >> of IoTDB:
>> > > > > > >>
>> > > > > > >> IoTDB maintains several independent engines (storage group)
>> that
>> > > > > supports
>> > > > > > >> read and write. In the following, we focus on one engine. A
>> engine
>> > > > > > >> maintains several closed data files and one unclosed data
>> file that
>> > > > > > >> receives appended data. In memory, there is only one working
>> memtable
>> > > > > (m1)
>> > > > > > >> that receives writes. There is also another memtable (m2)
>> that will
>> > > > > take
>> > > > > > >> place m1 when m1 is full and being flushed.
>> > > > > > >>
>> > > > > > >> When a data item is inserted:
>> > > > > > >>
>> > > > > > >> (1)We insert it into the working memtable.
>> > > > > > >> (2)We check the size of the memtable. If it reaches a
>> threshold, we
>> > > > > > >> submit a flush task “after the previous flush task is
>> finished” and
>> > > > > switch
>> > > > > > >> the two memtables.
>> > > > > > >> (3)We check the size of the unclosed file. If it reaches a
>> threshold,
>> > > > > we
>> > > > > > >> close it “after the previous flush task is finished”.
>> > > > > > >>
>> > > > > > >> In the above steps, all the "after the previous flush task is
>> > > > > finished"
>> > > > > > >> will block the insertion process. One solution is to make
>> all flush
>> > > > > and
>> > > > > > >> close task asynchronous. Some questions need to carefully
>> considered:
>> > > > > > >>
>> > > > > > >> (1) Many memtables may be flushed concurrently to an
>> unclosed file.
>> > > > > How
>> > > > > > >> to guarantee the order of serialization?
>> > > > > > >> (2) Once a close task is submitted, a new unclosed file will
>> be
>> > > > > created
>> > > > > > >> and receives appended data. So there will exists many
>> unclosed files.
>> > > > > How
>> > > > > > >> the query and compaction process will be impacted?
>> > > > > > >>
>> > > > > > >> Thanks,
>> > > > > > >>
>> > > > > > >> Jialin Qiao
>> > > > > > >> School of Software, Tsinghua University
>> > > > > > >>
>> > > > > > >> 乔嘉林
>> > > > > > >> 清华大学 软件学院
>> > > > > > >>
>> > > > > > >> > -----原始邮件-----
>> > > > > > >> > 发件人: "Xiangdong Huang" <[email protected]>
>> > > > > > >> > 发送时间: 2019-06-04 23:08:34 (星期二)
>> > > > > > >> > 收件人: [email protected], "江天" <[email protected]>
>> > > > > > >> > 抄送:
>> > > > > > >> > 主题: Re: [jira] [Created] (IOTDB-112) Avoid long tail
>> insertion
>> > > > > which is
>> > > > > > >> caused by synchronized close-bufferwrite
>> > > > > > >> >
>> > > > > > >> > I attached the histogram of the latency in the JIRA.
>> > > > > > >> >
>> > > > > > >> > The x-axis is the latency while the y-axis is the
>> cumulative
>> > > > > > >> distribution.
>> > > > > > >> > We can see that about 30% insertion can be finished in
>> 20ms, and 60%
>> > > > > > >> > insertion can be finished in 40ms even though the IoTDB
>> instance is
>> > > > > > >> serving
>> > > > > > >> > for a heavy workload... So, eliminating the long tail
>> insertion can
>> > > > > make
>> > > > > > >> > the average latency far better.
>> > > > > > >> >
>> > > > > > >> > If someone is working on the refactor_overflow or
>> > > > > refactor_bufferwrite,
>> > > > > > >> > please pay attention to the code branch for this issue.
>> > > > > > >> >
>> > > > > > >> > Best,
>> > > > > > >> >
>> > > > > > >> > -----------------------------------
>> > > > > > >> > Xiangdong Huang
>> > > > > > >> > School of Software, Tsinghua University
>> > > > > > >> >
>> > > > > > >> >  黄向东
>> > > > > > >> > 清华大学 软件学院
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > xiangdong Huang (JIRA) <[email protected]> 于2019年6月4日周二
>> 下午11:00写道：
>> > > > > > >> >
>> > > > > > >> > > xiangdong Huang created IOTDB-112:
>> > > > > > >> > > -------------------------------------
>> > > > > > >> > >
>> > > > > > >> > >              Summary: Avoid long tail insertion which is
>> caused by
>> > > > > > >> > > synchronized close-bufferwrite
>> > > > > > >> > >                  Key: IOTDB-112
>> > > > > > >> > >                  URL:
>> > > > > https://issues.apache.org/jira/browse/IOTDB-112
>> > > > > > >> > >              Project: Apache IoTDB
>> > > > > > >> > >           Issue Type: Improvement
>> > > > > > >> > >             Reporter: xiangdong Huang
>> > > > > > >> > >
>> > > > > > >> > >
>> > > > > > >> > > In our test, IoTDB has a good insertion performance, and
>> the
>> > > > > average
>> > > > > > >> > > latency can be ~200 ms in a given workload and hardware.
>> > > > > > >> > >
>> > > > > > >> > > However, when we draw the histogram of the latency, we
>> find that
>> > > > > 97.5%
>> > > > > > >> > > latencies are less than 200 ms, while 2.7% latencies are
>> greater.
>> > > > > The
>> > > > > > >> > > result shows that there are some long tail latency.
>> > > > > > >> > >
>> > > > > > >> > > Then we find that some insertion latencies are about 30
>> seconds...
>> > > > > > >> (but
>> > > > > > >> > > the ratio is less than 0.5%). Indeed, for each
>> connection, a long
>> > > > > tail
>> > > > > > >> > > insertion appears per 1 or 2 minutes....
>> > > > > > >> > >
>> > > > > > >> > > By reading source codes, I think it is because that in
>> the
>> > > > > insertion
>> > > > > > >> > > function,
>> > > > > > >> > >
>> > > > > > >> > > `private void insertBufferWrite(FileNodeProcessor
>> > > > > fileNodeProcessor,
>> > > > > > >> long
>> > > > > > >> > > timestamp,
>> > > > > > >> > >  boolean isMonitor, TSRecord tsRecord, String deviceId)`,
>> > > > > > >> > >
>> > > > > > >> > > if the corresponding TsFile is too large, the function
>> is blocked
>> > > > > > >> until
>> > > > > > >> > > the memtable is flushed on disk and the TsFile is sealed
>> (we call
>> > > > > it
>> > > > > > >> as
>> > > > > > >> > > closing a TsFile). The latencies of the long tail
>> insertions are
>> > > > > very
>> > > > > > >> close
>> > > > > > >> > > to the time cost of flushing and sealing a TsFile.
>> > > > > > >> > >
>> > > > > > >> > > So, if we set the closing function using the async mode,
>> we can
>> > > > > avoid
>> > > > > > >> the
>> > > > > > >> > > long tail insertion.
>> > > > > > >> > >
>> > > > > > >> > > However,  there are some side effects we have to fix:
>> > > > > > >> > >  # At the same time, if a new insertion comes, then a
>> new memtable
>> > > > > > >> should
>> > > > > > >> > > be assigned, and a new unsealed TsFile is created;
>> > > > > > >> > >  # That means that there are more than 1 unsealed
>> TsFiles if the
>> > > > > > >> system is
>> > > > > > >> > > crashed before the closing function is finished. So, we
>> have to
>> > > > > > >> modify the
>> > > > > > >> > > startup process to recover these files.
>> > > > > > >> > >
>> > > > > > >> > > Is there any other side effect that I have to pay
>> attention to?
>> > > > > > >> > >
>> > > > > > >> > >
>> > > > > > >> > >
>> > > > > > >> > > --
>> > > > > > >> > > This message was sent by Atlassian JIRA
>> > > > > > >> > > (v7.6.3#76005)
>> > > > > > >> > >
>> > > > > > >>
>> > > > > > >
>> > > > >
>>
>

Re: Re: Avoid long-tail insertion

Reply via email to