kaori-seasons commented on PR #1923:
URL: https://github.com/apache/fluss/pull/1923#issuecomment-3514875844

   @agoncharuk  Hello, I partially agree with your point. FlussLakehouseReader 
is necessary, but it should rely more on the existing Trino Lakehouse connector 
rather than reimplementing all functionality.
   
   This is where I partially agree with your point. However, for the aspect of 
split parallel processing, our implementation can better utilize the built-in 
parallelism mechanism of the Lakehouse storage format.
   
   In FlussSplit, the parallel unit is implemented based on TableBucket. In 
FlussSplitManager, Split generation is based on the bucket distribution of the 
table.
   
   ```
   // FlussSplitManager.java
   // Get bucket count from table descriptor
   TableDescriptor tableDescriptor = tableInfo.getTableDescriptor();
   Optional<Integer> bucketCount = 
tableDescriptor.getDistribution().getBucketCount();
   
   if (bucketCount.isPresent()) {
       int numBuckets = bucketCount.get();
       // 为每个桶创建一个Split
       for (int bucketId : prunedBuckets) {
           if (bucketId >= 0 && bucketId < numBuckets) {
               TableBucket tableBucket = new 
TableBucket(tableInfo.getTableId(), bucketId);
               FlussSplit split = new FlussSplit(tablePath, tableBucket, 
addresses);
               splits.add(split);
           }
       }
   }
   ```
   
   As can be seen from the code, the FlussSplit mechanism is based on table 
buckets and is suitable for parallel reading of real-time Fluss data, but it 
cannot fully utilize the fine-grained parallel capabilities of the Lakehouse 
storage format. Lakehouse storage formats (such as Parquet, Paimon, and 
Iceberg) have finer-grained parallel mechanisms that can perform parallel 
processing based on files, row groups, etc., thereby achieving higher resource 
utilization and better query performance.
   
   Lakehouse has two significant advantages: Strong dynamic adaptability:
   
   -  Lakehouse can dynamically adjust its parallelism based on file size and 
data distribution. FlussSplit's parallelism is fixed, determined by the number 
of buckets in the table. 
   - Higher resource utilization: Lakehouse can dynamically adjust its 
parallelism during queries based on data volume and computing resources. 
FlussSplit's parallelism is fixed at table creation, which may not fully 
utilize computing resources.
   
   @wuchong @luoyuxia Do you have any suggestions on this?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to