(I posted this on Slack originally) Hey folks, I’m writing a batch connector for an in-house data lake and doing some performance work now… I’ve noticed my ScanBuilder creates a Scan exactly once, but its toBatch method is being called three times, returning the identical object every time, then the batch’s planInputPartitions method is being called twice, doing a large amount of redundant work. I'm targeting Spark 3.3.2 currently because EMR doesn't support Spark 3.4.x yet.
This is all a single node, local mode. planInputPartitions() is itself a somewhat expensive operation so I’d rather not have it being called twice. I haven’t implemented SupportsRuntimeFiltering yet, but I’m not confident it would help with this specific problem. The javadoc for planInputPartitions says it’ll "be called only once, to launch one Spark job", OTOH https://github.com/vertica/spark-connector/issues/171#issuecomment-1051162865 says it’s normal for it to be called twice Well, at least it’s called on the same instance both times, so I can just cache the results I guess… annoying though. Is there a well-known better way to avoid this inefficiency? Is it a bug? Thanks! -0xe1a