Hello,

We're mainly reading parquet files from HDFS and it works very well. But
when trying to read the same kind of file from s3, we found that Spark
creates only one input partition/split per file, which pretty much limits
the parallelization. It's the same for both s3n and the newer s3a protocol.

It seems to me that the input split is determined by the inputFormat and I
can't see why it doesn't work with s3. Any pointers? Do I miss some configs?

Thanks!

-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Reply via email to