WinkerDu opened a new pull request #3073: URL: https://github.com/apache/iceberg/pull/3073
In our scene, We use V2Format to support streaming CDC row-level insert / delete. There could be tons of delete file for scan task of small file rewrite action or query scan, such as Spark Batch scan typically. The existing scan task bin-packing logic is only based on data file size, means a scan task contains data files whose total file size satisfies a given target size, this logic work fine for V1Format since the task only deals with data file. But for V2Format, scan task performance also consists of delete file applying cost. Suppose that bin-packing target size is 128MB, the data file could be small due to streaming CDC update, such as 1 MB, each data file tries to apply 128 valid eq-delete / pos-delete files. the total delete file applied number of 1 task could reach to 128 * 128 = 16,384, even we have enough CPU core to run these tasks, we could not boost scan performance for 1 task. This PR introduce a new configuration to specified items size of 1 bin during bin-packing iterating (default as Integer.MAX_VALUE). We could set this to control task scale to boost global scan performance to fully utilize computing resource. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
