Hello, We're mainly reading parquet files from HDFS and it works very well. But when trying to read the same kind of file from s3, we found that Spark creates only one input partition/split per file, which pretty much limits the parallelization. It's the same for both s3n and the newer s3a protocol.
It seems to me that the input split is determined by the inputFormat and I can't see why it doesn't work with s3. Any pointers? Do I miss some configs? Thanks! -- *JU Han* Software Engineer @ Teads.tv +33 0619608888
