Sicheng Yu created IOTDB-6267:
---------------------------------

             Summary: Load 2.0
                 Key: IOTDB-6267
                 URL: https://issues.apache.org/jira/browse/IOTDB-6267
             Project: Apache IoTDB
          Issue Type: Improvement
            Reporter: Sicheng Yu
            Assignee: Sicheng Yu


- MPP Load has some problems with stability and compatibility with the Pipe 
system, and there is room for optimization in loading speed.
Issue 1: Uncontrollable upper limit of total memory used when multiple Load 
statements are executed concurrently.
- Currently, Load only strictly controls the upper limit of memory used by a 
single Load statement during its execution life cycle.
- When a large number of Load statements are executed concurrently, the total 
memory size used by these Load statements is uncontrollable.
- Please refer to MPP Load memory footprint for the memory usage during the 
execution life cycle of a single Load statement.
Issue 2: New data added by Load is not properly recognized by the Pipe system.
- The Pipe system currently adds a ProgressIndex to all new data added to the 
IoTDB (see the discussion of Key Issues in Pipe System Design and 
Implementation).
- In a normal write process, the process of adding the index is realized by the 
consensus layer.
- In the normal write process, the process of adding an identifier is 
implemented by the consensus layer. However, the current Load's two-phase 
transaction commit process does not go through the consensus layer, and does 
not have a normal progress identifier, nor can it be correctly recognized by 
the Pipe system when restarting the task.
Issue 3: Too many serialization steps in the Load TsFile process.
- In the LoadTsFileScheduler class, the implementation of MPP Load 1.0 is to
  - Iterate through each TsFile in the TsFile.
  - For each TsFile, perform split first, then send, and then perform the 
second stage after all the sends are completed.
  - After completing the second phase, the next TsFile is loaded sequentially.
- Since the TsFile splitting process may involve memory computation, the disk 
IO capacity is not fully utilized during the memory computation.
- A single TsFile is serialized during splitting and sending via Thrift, 
waiting for both disk IO and network IO.

document link:https://apache-iotdb.feishu.cn/docx/UE9Od5caDoLoYJxt4Ptc4s0hnof



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to