Re: Avoid long-tail insertion

Xiangdong Huang Tue, 02 Jul 2019 04:12:16 -0700

Hi,

I have reviewed the codes. The new structure is far better than previous
version.


A question is that,  it seems that the merge process and and data sync
module are not finished in your branch.

If we merge this PR into the master branch, we need to finish the two
modules ASAP..

Best,
-----------------------------------
Xiangdong Huang
School of Software, Tsinghua University

 黄向东
清华大学 软件学院


Jialin Qiao <[email protected]> 于2019年7月2日周二 上午8:38写道：

> Hi,
>
> After several days' hard work, the new storage engine and corresponding
> query engine is finally stable and behaves well...
>
> I have submitted a PR[1]. The new engine reduces the long-tail latency
> from 50+s to under 5s. Detailed performance evaluation result will be
> presented later.
>
> In this version, there are several new modules and strategies:
>
> Modules:
>
> (1) TVListPool: A cache of TVList, which is designed for reusing the
> TVList and reducing the GC.
> (2) PrimitiveArrayPool: A cache of arrays of primitive data type in java,
> such as int[], long[], double[].
> (3) TsFileResource: Each data file is denoted by a TsFileResource, which
> is used for query.
>
> Strategies:
>
> (1) Asynchronously flushing memtable to disk.
> (2) Asynchronously appending metadata and closing data file.
>
>
> [1]https://github.com/apache/incubator-iotdb/pull/217
>
> Best,
> --
> Jialin Qiao
> School of Software, Tsinghua University
>
> 乔嘉林
> 清华大学 软件学院
>
> > -----原始邮件-----
> > 发件人: "Xiangdong Huang" <[email protected]>
> > 发送时间: 2019-06-28 20:45:44 (星期五)
> > 收件人: [email protected]
> > 抄送:
> > 主题: Re: Re: Avoid long-tail insertion
> >
> > I see. The following sentence is very important to guarantee the
> > correctness...
> >
> > >  At the same time, resubmitting UFP after the end of the task in each
> > Flush thread ensures that all tasks can be executed.
> >
> > I think it should be highlight listed, rather than be drown in the last
> > paragraph.
> >
> > Anyway, now I understand what you want to do.
> >
> > Best,
> > -----------------------------------
> > Xiangdong Huang
> > School of Software, Tsinghua University
> >
> >  黄向东
> > 清华大学 软件学院
> >
> >
> > Xiangdong Huang <[email protected]> 于2019年6月28日周五 下午8:41写道：
> >
> > > Hi Tianan,
> > >
> > > > the flush tasks in each UFP(There is a flushing queue in UFP) need
> to be
> > > executed sequentially.
> > > > Flush thread polls the first UFP from UFP queue in FlushManager ->
> polls
> > > the first flush task in UFP -> completes the flush task -> set
> > > ‘managedByFlushManager’  of the UFP to false.
> > >
> > > It indicates that there are more than one flush tasks in a UFP, but the
> > > FlushManager  just spends one task from the UFP and then it mark the
> UFP as
> > > managedByFlushManager=false and poll it out of the queue? (So, when to
> > > flush the rest tasks??)
> > >
> > > > (1) UFP is not managed by Flush Manager, i.e.'managedByFlushManager'
> is
> > > false
> > >
> > > If suppose the UFP has one flush task and enqueued the queue. Then the
> UFP
> > > has the second flush task. However, it can not be added into the queue
> > > because managedByFlushManager == true. Using your above logic, you will
> > > ignore the second flush task....
> > >
> > >
> > > > Flush Manager first determines whether UFP meets the criteria for
> > > submission
> > >
> > > Do you want to say both the two criteria should be satisfied? (If so,
> the
> > > above  hypothetical situation will occur).
> > >
> > > > Through the above design, we can ensure that at the same time for
> each
> > > UFP, Flush Manager will only manage at most once and execute at most
> one
> > > flush task, while there is no restriction between different UFPs.
> > >
> > > Using your design, we can ensure that at a certain time, given a UFP,
> the
> > > Flush Manager will only manage a UFP at most once and execute at most
> one
> > > flush task from the UFP, but how to avoid the above hypothetical
> situation?
> > >
> > > Best,
> > > -----------------------------------
> > > Xiangdong Huang
> > > School of Software, Tsinghua University
> > >
> > >  黄向东
> > > 清华大学 软件学院
> > >
> > >
> > > 李天安 <[email protected]> 于2019年6月28日周五 上午11:29写道：
> > >
> > >> Hi，
> > >>  I'm also involved in the design of the new storage engine. Let me
> > >> complement the new design of the flush task.
> > >>
> > >>  To improve system performance, we changed flush tasks from
> synchronous
> > >> to asynchronous. We introduced a Flush Manager to manage all flush
> tasks.
> > >> The tricky problem is that each Unsealed TsFile Processor (UFP)
> corresponds
> > >> to a data file on a disk, so the flush tasks in each UFP(There is a
> > >> flushing queue in UFP) need to be executed sequentially. However,
> flush
> > >> tasks in different UFPs have no sequential requirements. How to
> design them
> > >> to meet the above requirements?
> > >>
> > >>  We introduce a UFP FIFO queue in Flush Manager, and add a boolean
> > >> attribute ‘managedByFlushManager’ to each UFP to indicate whether it
> is
> > >> managed by Flush Manager. Flush Manager maintains a Flush thread pool
> to
> > >> perform Flush tasks, so the lifecycle of a Flush task is
> > >> 1. UFP are submitted to FlushManager，FlushManager add UFP to its queue
> > >> and set ‘managedByFlushManager’  of the UFP to true.
> > >> 2. The Flush Pool in FlushManager start a flush thread to execute
> task.
> > >> 3. Flush thread polls the first UFP from UFP queue in FlushManager ->
> > >> polls the first flush task in UFP -> completes the flush task -> set
> > >> ‘managedByFlushManager’  of the UFP to false.
> > >>
> > >> There are two ways to submit a UFP to FlushManager:
> > >> 1. UFP, whenever a MemTable reaches a certain size or forcibly
> triggers a
> > >> flush task, it submits itself to Flush Manager (because the queue in
> Flush
> > >> Manager is UFP). Flush Manager first determines whether UFP meets the
> > >> criteria for submission:
> > >> (1) UFP is not managed by Flush Manager, i.e.'managedByFlushManager'
> is
> > >> false
> > >> (2) The Flush task queue in UFP is not empty, that is, there are at
> least
> > >> one flush task to be executed.
> > >>
> > >> 2. When the Flush thread completes the flush task, it sets
> > >> ‘managedByFlushManager’ to false and resubmits the UFP of the
> completed
> > >> flush task to the FlushManager.
> > >>
> > >> Through the above design, we can ensure that at the same time for each
> > >> UFP, Flush Manager will only manage at most once and execute at most
> one
> > >> flush task, while there is no restriction between different UFPs. At
> the
> > >> same time, resubmitting UFP after the end of the task in each Flush
> thread
> > >> ensures that all tasks can be executed. Therefore, we solve the above
> > >> problem and the design meets the requirements of Flush Manager.
> > >>
> > >> Best Regards,
> > >> -------------------------------------
> > >> Tianan Li
> > >> School of Software, Tsinghua University
> > >>
> > >> > -----原始邮件-----
> > >> > 发件人: "Jialin Qiao" <[email protected]>
> > >> > 发送时间: 2019-06-27 11:27:24 (星期四)
> > >> > 收件人: [email protected]
> > >> > 抄送:
> > >> > 主题: Re: Avoid long-tail insertion
> > >> >
> > >> > Hi,
> > >> >
> > >> > The new storage engine is designed to have the following components:
> > >> >
> > >> > (1) MemTable: A memory structure, which stores all inserted data in
> > >> memory.
> > >> >
> > >> > (2) MemtablePool: Manages all memtables. All memtables are gotten
> from
> > >> this pool. The total number of memtables is fixed
> > >> > in the system. Once the pool do not has available memtables, the
> > >> getMemtable() operation will wait or directly return.
> > >> >
> > >> > (3) UnsealedTsFileProcessor (UFP): A writer for one data file. It
> > >> always has one working memtable that receives writes and a
> > >> > list (flushing list) of memtables that for flush. Once the working
> > >> memtable reaches a threshold, it will be moved to the
> > >> > flushing list and the working memtable is set null. When a new write
> > >> arrives, if the working memtable is null, UFP will
> > >> > call getMemtable() of the MemtablePool to get one as the working
> > >> memtable.
> > >> >
> > >> > (4) StorageGroupProcessor (SGP): Each SGP is responsible for all
> writes
> > >> and reads in one storage group. It always has one
> > >> > working UFP that receives write and a list (closing list) of UFPs
> that
> > >> for close. Once the file size of the working UFP reaches
> > >> > a threshold, the UFP is moved to the closing list and the working
> UFP
> > >> is set null. When a new write arrives, if the working UFP
> > >> > is null, a new UFP is generated as working UFP and receives write.
> > >> >
> > >> > (5) StorageGroupManager (SGM): A manager of all SGPs in IoTDB. It is
> > >> only responsible for routing read and write operations
> > >> > to its corresponding SGP according to the deviceId of the operation.
> > >> >
> > >> > (6) Flush thread: The flush thread poll a memtable from the flushing
> > >> list in UFP and flush a memtable to disk. After flushing,
> > >> > the memtable is returned to the MemtablePool.
> > >> >
> > >> > These are only the main components of the new storage engine. Some
> > >> things may be lost. It would be great if someone could
> > >> > give some advices or supplementations.
> > >> >
> > >> > Best,
> > >> > --
> > >> > Jialin Qiao
> > >> > School of Software, Tsinghua University
> > >> >
> > >> > 乔嘉林
> > >> > 清华大学 软件学院
> > >> >
> > >> > > -----原始邮件-----
> > >> > > 发件人: "Jialin Qiao" <[email protected]>
> > >> > > 发送时间: 2019-06-24 20:24:05 (星期一)
> > >> > > 收件人: [email protected]
> > >> > > 抄送:
> > >> > > 主题: Re: Re: Re: Avoid long-tail insertion
> > >> > >
> > >> > >
> > >> > > Yes, there are many changes. The branch I am working on is
> > >> feature_async_close_tsfile.
> > >> > > Anyone interested is welcome to join and discuss.
> > >> > >
> > >> > > Best,
> > >> > > --
> > >> > > Jialin Qiao
> > >> > > School of Software, Tsinghua University
> > >> > >
> > >> > > 乔嘉林
> > >> > > 清华大学 软件学院
> > >> > >
> > >> > > > -----原始邮件-----
> > >> > > > 发件人: "Xiangdong Huang" <[email protected]>
> > >> > > > 发送时间: 2019-06-23 10:59:29 (星期日)
> > >> > > > 收件人: [email protected]
> > >> > > > 抄送:
> > >> > > > 主题: Re: Re: Avoid long-tail insertion
> > >> > > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > Once your work branch is almost ready, let me know so I can
> help to
> > >> review.
> > >> > > > I think it is a HUGE PR...
> > >> > > >
> > >> > > > -----------------------------------
> > >> > > > Xiangdong Huang
> > >> > > > School of Software, Tsinghua University
> > >> > > >
> > >> > > >  黄向东
> > >> > > > 清华大学 软件学院
> > >> > > >
> > >> > > >
> > >> > > > Jialin Qiao <[email protected]> 于2019年6月22日周六
> 下午9:57写道：
> > >> > > >
> > >> > > > > Hi Xiangdong,
> > >> > > > >
> > >> > > > > I will merge this patch. Let "Directories" manage the folders
> of
> > >> both
> > >> > > > > sequence and unSequence files is good.
> > >> > > > >
> > >> > > > > However, the naming of "Directories" is not clear. It would be
> > >> better to
> > >> > > > > rename to "DirectoryManager"
> > >> > > > >
> > >> > > > > Best,
> > >> > > > > --
> > >> > > > > Jialin Qiao
> > >> > > > > School of Software, Tsinghua University
> > >> > > > >
> > >> > > > > 乔嘉林
> > >> > > > > 清华大学 软件学院
> > >> > > > >
> > >> > > > > > -----原始邮件-----
> > >> > > > > > 发件人: "Xiangdong Huang" <[email protected]>
> > >> > > > > > 发送时间: 2019-06-22 16:35:29 (星期六)
> > >> > > > > > 收件人: [email protected]
> > >> > > > > > 抄送:
> > >> > > > > > 主题: Re: Avoid long-tail insertion
> > >> > > > > >
> > >> > > > > > Hi jialin,
> > >> > > > > >
> > >> > > > > > I submit some modifications for:
> > >> > > > > >
> > >> > > > > > * add the overflow data folder location setting in the
> > >> > > > > > iotdb-engine.properties;
> > >> > > > > > * let Directories.java to manage the above folder.
> > >> > > > > >
> > >> > > > > > If you need to refactor the overflow when you solving the
> long
> > >> tail
> > >> > > > > issue,
> > >> > > > > > you can apply the patch from [1] first to simplify your
> work.
> > >> > > > > >
> > >> > > > > > [1]
> > >> > > > > >
> > >> > > > >
> > >>
> https://issues.apache.org/jira/secure/attachment/12972547/overflow-folder.patch
> > >> > > > > >
> > >> > > > > > Best,
> > >> > > > > > -----------------------------------
> > >> > > > > > Xiangdong Huang
> > >> > > > > > School of Software, Tsinghua University
> > >> > > > > >
> > >> > > > > >  黄向东
> > >> > > > > > 清华大学 软件学院
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Xiangdong Huang <[email protected]> 于2019年6月22日周六
> 下午3:19写道：
> > >> > > > > >
> > >> > > > > > > If you change the process like this, i.e., there are more
> > >> than one
> > >> > > > > > > unsealed TsFiles for each storage group, then  you have to
> > >> modify the
> > >> > > > > WAL
> > >> > > > > > > module.. Because current WAL module only recognizes the
> last
> > >> unsealed
> > >> > > > > > > TsFile..
> > >> > > > > > >
> > >> > > > > > > By the way, "sealed" is better than "closed", I think..  A
> > >> sealed file
> > >> > > > > > > means the file which has the magic string at the head and
> the
> > >> tail.
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > -----------------------------------
> > >> > > > > > > Xiangdong Huang
> > >> > > > > > > School of Software, Tsinghua University
> > >> > > > > > >
> > >> > > > > > >  黄向东
> > >> > > > > > > 清华大学 软件学院
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Jialin Qiao <[email protected]> 于2019年6月22日周六
> > >> 下午2:54写道：
> > >> > > > > > >
> > >> > > > > > >>
> > >> > > > > > >> Hi, I am solving the long-tail latency problem.
> > >> > > > > > >>
> > >> > > > > > >> There are some cases (blocking points) that blocking the
> > >> insertion.
> > >> > > > > For a
> > >> > > > > > >> better understanding of this problem, I first introduce
> the
> > >> writing
> > >> > > > > process
> > >> > > > > > >> of IoTDB:
> > >> > > > > > >>
> > >> > > > > > >> IoTDB maintains several independent engines (storage
> group)
> > >> that
> > >> > > > > supports
> > >> > > > > > >> read and write. In the following, we focus on one
> engine. A
> > >> engine
> > >> > > > > > >> maintains several closed data files and one unclosed data
> > >> file that
> > >> > > > > > >> receives appended data. In memory, there is only one
> working
> > >> memtable
> > >> > > > > (m1)
> > >> > > > > > >> that receives writes. There is also another memtable (m2)
> > >> that will
> > >> > > > > take
> > >> > > > > > >> place m1 when m1 is full and being flushed.
> > >> > > > > > >>
> > >> > > > > > >> When a data item is inserted:
> > >> > > > > > >>
> > >> > > > > > >> (1)We insert it into the working memtable.
> > >> > > > > > >> (2)We check the size of the memtable. If it reaches a
> > >> threshold, we
> > >> > > > > > >> submit a flush task “after the previous flush task is
> > >> finished” and
> > >> > > > > switch
> > >> > > > > > >> the two memtables.
> > >> > > > > > >> (3)We check the size of the unclosed file. If it reaches
> a
> > >> threshold,
> > >> > > > > we
> > >> > > > > > >> close it “after the previous flush task is finished”.
> > >> > > > > > >>
> > >> > > > > > >> In the above steps, all the "after the previous flush
> task is
> > >> > > > > finished"
> > >> > > > > > >> will block the insertion process. One solution is to make
> > >> all flush
> > >> > > > > and
> > >> > > > > > >> close task asynchronous. Some questions need to carefully
> > >> considered:
> > >> > > > > > >>
> > >> > > > > > >> (1) Many memtables may be flushed concurrently to an
> > >> unclosed file.
> > >> > > > > How
> > >> > > > > > >> to guarantee the order of serialization?
> > >> > > > > > >> (2) Once a close task is submitted, a new unclosed file
> will
> > >> be
> > >> > > > > created
> > >> > > > > > >> and receives appended data. So there will exists many
> > >> unclosed files.
> > >> > > > > How
> > >> > > > > > >> the query and compaction process will be impacted?
> > >> > > > > > >>
> > >> > > > > > >> Thanks,
> > >> > > > > > >>
> > >> > > > > > >> Jialin Qiao
> > >> > > > > > >> School of Software, Tsinghua University
> > >> > > > > > >>
> > >> > > > > > >> 乔嘉林
> > >> > > > > > >> 清华大学 软件学院
> > >> > > > > > >>
> > >> > > > > > >> > -----原始邮件-----
> > >> > > > > > >> > 发件人: "Xiangdong Huang" <[email protected]>
> > >> > > > > > >> > 发送时间: 2019-06-04 23:08:34 (星期二)
> > >> > > > > > >> > 收件人: [email protected], "江天" <[email protected]>
> > >> > > > > > >> > 抄送:
> > >> > > > > > >> > 主题: Re: [jira] [Created] (IOTDB-112) Avoid long tail
> > >> insertion
> > >> > > > > which is
> > >> > > > > > >> caused by synchronized close-bufferwrite
> > >> > > > > > >> >
> > >> > > > > > >> > I attached the histogram of the latency in the JIRA.
> > >> > > > > > >> >
> > >> > > > > > >> > The x-axis is the latency while the y-axis is the
> > >> cumulative
> > >> > > > > > >> distribution.
> > >> > > > > > >> > We can see that about 30% insertion can be finished in
> > >> 20ms, and 60%
> > >> > > > > > >> > insertion can be finished in 40ms even though the IoTDB
> > >> instance is
> > >> > > > > > >> serving
> > >> > > > > > >> > for a heavy workload... So, eliminating the long tail
> > >> insertion can
> > >> > > > > make
> > >> > > > > > >> > the average latency far better.
> > >> > > > > > >> >
> > >> > > > > > >> > If someone is working on the refactor_overflow or
> > >> > > > > refactor_bufferwrite,
> > >> > > > > > >> > please pay attention to the code branch for this issue.
> > >> > > > > > >> >
> > >> > > > > > >> > Best,
> > >> > > > > > >> >
> > >> > > > > > >> > -----------------------------------
> > >> > > > > > >> > Xiangdong Huang
> > >> > > > > > >> > School of Software, Tsinghua University
> > >> > > > > > >> >
> > >> > > > > > >> >  黄向东
> > >> > > > > > >> > 清华大学 软件学院
> > >> > > > > > >> >
> > >> > > > > > >> >
> > >> > > > > > >> > xiangdong Huang (JIRA) <[email protected]> 于2019年6月4日周二
> > >> 下午11:00写道：
> > >> > > > > > >> >
> > >> > > > > > >> > > xiangdong Huang created IOTDB-112:
> > >> > > > > > >> > > -------------------------------------
> > >> > > > > > >> > >
> > >> > > > > > >> > >              Summary: Avoid long tail insertion
> which is
> > >> caused by
> > >> > > > > > >> > > synchronized close-bufferwrite
> > >> > > > > > >> > >                  Key: IOTDB-112
> > >> > > > > > >> > >                  URL:
> > >> > > > > https://issues.apache.org/jira/browse/IOTDB-112
> > >> > > > > > >> > >              Project: Apache IoTDB
> > >> > > > > > >> > >           Issue Type: Improvement
> > >> > > > > > >> > >             Reporter: xiangdong Huang
> > >> > > > > > >> > >
> > >> > > > > > >> > >
> > >> > > > > > >> > > In our test, IoTDB has a good insertion performance,
> and
> > >> the
> > >> > > > > average
> > >> > > > > > >> > > latency can be ~200 ms in a given workload and
> hardware.
> > >> > > > > > >> > >
> > >> > > > > > >> > > However, when we draw the histogram of the latency,
> we
> > >> find that
> > >> > > > > 97.5%
> > >> > > > > > >> > > latencies are less than 200 ms, while 2.7% latencies
> are
> > >> greater.
> > >> > > > > The
> > >> > > > > > >> > > result shows that there are some long tail latency.
> > >> > > > > > >> > >
> > >> > > > > > >> > > Then we find that some insertion latencies are about
> 30
> > >> seconds...
> > >> > > > > > >> (but
> > >> > > > > > >> > > the ratio is less than 0.5%). Indeed, for each
> > >> connection, a long
> > >> > > > > tail
> > >> > > > > > >> > > insertion appears per 1 or 2 minutes....
> > >> > > > > > >> > >
> > >> > > > > > >> > > By reading source codes, I think it is because that
> in
> > >> the
> > >> > > > > insertion
> > >> > > > > > >> > > function,
> > >> > > > > > >> > >
> > >> > > > > > >> > > `private void insertBufferWrite(FileNodeProcessor
> > >> > > > > fileNodeProcessor,
> > >> > > > > > >> long
> > >> > > > > > >> > > timestamp,
> > >> > > > > > >> > >  boolean isMonitor, TSRecord tsRecord, String
> deviceId)`,
> > >> > > > > > >> > >
> > >> > > > > > >> > > if the corresponding TsFile is too large, the
> function
> > >> is blocked
> > >> > > > > > >> until
> > >> > > > > > >> > > the memtable is flushed on disk and the TsFile is
> sealed
> > >> (we call
> > >> > > > > it
> > >> > > > > > >> as
> > >> > > > > > >> > > closing a TsFile). The latencies of the long tail
> > >> insertions are
> > >> > > > > very
> > >> > > > > > >> close
> > >> > > > > > >> > > to the time cost of flushing and sealing a TsFile.
> > >> > > > > > >> > >
> > >> > > > > > >> > > So, if we set the closing function using the async
> mode,
> > >> we can
> > >> > > > > avoid
> > >> > > > > > >> the
> > >> > > > > > >> > > long tail insertion.
> > >> > > > > > >> > >
> > >> > > > > > >> > > However,  there are some side effects we have to fix:
> > >> > > > > > >> > >  # At the same time, if a new insertion comes, then a
> > >> new memtable
> > >> > > > > > >> should
> > >> > > > > > >> > > be assigned, and a new unsealed TsFile is created;
> > >> > > > > > >> > >  # That means that there are more than 1 unsealed
> > >> TsFiles if the
> > >> > > > > > >> system is
> > >> > > > > > >> > > crashed before the closing function is finished. So,
> we
> > >> have to
> > >> > > > > > >> modify the
> > >> > > > > > >> > > startup process to recover these files.
> > >> > > > > > >> > >
> > >> > > > > > >> > > Is there any other side effect that I have to pay
> > >> attention to?
> > >> > > > > > >> > >
> > >> > > > > > >> > >
> > >> > > > > > >> > >
> > >> > > > > > >> > > --
> > >> > > > > > >> > > This message was sent by Atlassian JIRA
> > >> > > > > > >> > > (v7.6.3#76005)
> > >> > > > > > >> > >
> > >> > > > > > >>
> > >> > > > > > >
> > >> > > > >
> > >>
> > >
>

Re: Avoid long-tail insertion

Reply via email to