kaori-seasons commented on PR #1923:
URL: https://github.com/apache/fluss/pull/1923#issuecomment-3514875844
@agoncharuk Hello, I partially agree with your point. FlussLakehouseReader
is necessary, but it should rely more on the existing Trino Lakehouse connector
rather than reimplementing all functionality.
This is where I partially agree with your point. However, for the aspect of
split parallel processing, our implementation can better utilize the built-in
parallelism mechanism of the Lakehouse storage format.
In FlussSplit, the parallel unit is implemented based on TableBucket. In
FlussSplitManager, Split generation is based on the bucket distribution of the
table.
```
// FlussSplitManager.java
// Get bucket count from table descriptor
TableDescriptor tableDescriptor = tableInfo.getTableDescriptor();
Optional<Integer> bucketCount =
tableDescriptor.getDistribution().getBucketCount();
if (bucketCount.isPresent()) {
int numBuckets = bucketCount.get();
// 为每个桶创建一个Split
for (int bucketId : prunedBuckets) {
if (bucketId >= 0 && bucketId < numBuckets) {
TableBucket tableBucket = new
TableBucket(tableInfo.getTableId(), bucketId);
FlussSplit split = new FlussSplit(tablePath, tableBucket,
addresses);
splits.add(split);
}
}
}
```
As can be seen from the code, the FlussSplit mechanism is based on table
buckets and is suitable for parallel reading of real-time Fluss data, but it
cannot fully utilize the fine-grained parallel capabilities of the Lakehouse
storage format. Lakehouse storage formats (such as Parquet, Paimon, and
Iceberg) have finer-grained parallel mechanisms that can perform parallel
processing based on files, row groups, etc., thereby achieving higher resource
utilization and better query performance.
Lakehouse has two significant advantages: Strong dynamic adaptability:
- Lakehouse can dynamically adjust its parallelism based on file size and
data distribution. FlussSplit's parallelism is fixed, determined by the number
of buckets in the table.
- Higher resource utilization: Lakehouse can dynamically adjust its
parallelism during queries based on data volume and computing resources.
FlussSplit's parallelism is fixed at table creation, which may not fully
utilize computing resources.
@wuchong @luoyuxia Do you have any suggestions on this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]