In Spark 1.4 and 1.5, you can do something like this:


BTW, I'm curious about how did you do it without partitionBy using saveAsHadoopFile?


On 9/8/15 2:34 PM, Adrien Mogenet wrote:
Hi there,

We've spent several hours to split our input data into several parquet files (or several folders, i.e. /datasink/output-parquets/<key>/foobar.parquet), based on a low-cardinality key. This works very well with a when using saveAsHadoopFile, but we can't achieve a similar thing with Parquet files.

The only working solution so far is to persist the RDD and then loop over it N times to write N files. That does not look acceptable...

Do you guys have any suggestion to do such an operation?


*Adrien Mogenet*
Head of Backend/Infrastructure <>
(+33) <>
50, avenue Montaigne - 75008 Paris

Reply via email to