stevenzwu commented on a change in pull request #3817:
URL: https://github.com/apache/iceberg/pull/3817#discussion_r779019623
##########
File path:
flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSplitGenerator.java
##########
@@ -37,9 +40,20 @@ private FlinkSplitGenerator() {
static FlinkInputSplit[] createInputSplits(Table table, ScanContext context)
{
List<CombinedScanTask> tasks = tasks(table, context);
FlinkInputSplit[] splits = new FlinkInputSplit[tasks.size()];
- for (int i = 0; i < tasks.size(); i++) {
- splits[i] = new FlinkInputSplit(i, tasks.get(i));
- }
+ boolean localityPreferred = context.locality();
+
+ Tasks.range(tasks.size())
+ .stopOnFailure()
+ .executeWith(localityPreferred ? ThreadPools.getWorkerPool() : null)
Review comment:
this avoids the thread pool when locality is disabled, which is better.
But I want to double check if we need to use thread pool here. I saw
`Util.blockLocations` eventually calls `getFileStatus` which probably will be a
network call to hadoop name node. So thread pool does make sense to me.
Looking at the current caller (`IcebergSplit.getLocations()`) of
`Util.blockLocations` in iceberg-mr module. It doesn't use thread pool. Loop
in @rdsr and @rdblue who might have more context here.
@hililiwei it might be simpler to call this API from `Util` class
```
public static String[] blockLocations(FileIO io, CombinedScanTask task)
```
This is how Spark code calls it.
```
if (localityPreferred) {
Table table = tableBroadcast.value();
this.preferredLocations = Util.blockLocations(table.io(), task);
} else {
this.preferredLocations = HadoopInputFile.NO_LOCATION_PREFERENCE;
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]