pengxianzi commented on issue #12585: URL: https://github.com/apache/hudi/issues/12585#issuecomment-2574262951
> > 警告:读取器落后于写入器太多,调整“read.tasks”选项以增加读取任务的并行性 > > 此日志表明流媒体阅读器滞后太多,可能有两个原因: > > 1. Kudu 本身的写入速度比较慢,导致输入分割在算子状态中堆积,造成延迟; > 2. 我看到声明了显式读取开始提交,那么这是一个长期历史提交吗?早于活动时间的第一个活动提交的开始提交将触发全表扫描和过滤。 Thank you for your response! We understand the potential reasons for the streaming read task lag: Slow Kudu Write Speed: Testing shows that regular MOR tables work fine with Kudu, but bucketed tables cause read task backlogs. Suspect the issue is related to bucketed table characteristics. Full Table Scan Due to Early Read Start Commit: We specified an early read start commit to read historical data. If we don’t specify it or set it later than the first active commit, how can we ensure reading from historical data? Are there ways to avoid a full table scan? Further Questions Impact of Bucketed Table: Does the bucketed table have a specific impact on read performance? Any optimization suggestions? Historical Data Reading: How to read historical data without triggering a full table scan? Are there incremental read configurations? Kudu Write Performance: Does the data distribution of bucketed tables affect Kudu write performance? Summary We would like to know: How to optimize read performance for bucketed tables. How to read historical data without a full table scan. Are there other configurations or tools to help troubleshoot? Thank you for your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
