Input split on s3n/s3a

Han JU Wed, 19 Aug 2015 04:39:07 -0700

Hello,

We're mainly reading parquet files from HDFS and it works very well. But
when trying to read the same kind of file from s3, we found that Spark
creates only one input partition/split per file, which pretty much limits
the parallelization. It's the same for both s3n and the newer s3a protocol.


It seems to me that the input split is determined by the inputFormat and I
can't see why it doesn't work with s3. Any pointers? Do I miss some configs?

Thanks!

-- 
*JU Han*

Software Engineer @ Teads.tv

+33 0619608888

Input split on s3n/s3a

Reply via email to