tustvold commented on PR #5057: URL: https://github.com/apache/arrow-datafusion/pull/5057#issuecomment-1404027622
Only had time to take a brief look at this PR, and so I'm likely missing something but please bear with me :smile: This PR modifies `ListingTable` to pair together `PartitionedFile` with `Vec<Option<FileRange>>`, this makes this approach specific to `ListingTable` and also adds parallelism control to a part of the system that doesn't really have context on how much parallelism is needed, nor what invariants such as sort orders may need to be upheld. I have two suggestions that may be stupid: * Make this a physical optimizer rule that looks at operators containing `FileScanConfig` and adds more partitions based on the `target_partitions` property * Rather than adding a new `FileRanges` property, instead using the existing `range: Option<FileRange>` already present on `PartitionedFile`, the same file with disjoint ranges can then appear in multiple partitions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
