Re: Split content into multiple Parquet files

Cheng Lian Tue, 08 Sep 2015 03:59:05 -0700

In Spark 1.4 and 1.5, you can do something like this:


df.write.partitionBy("key").parquet("/datasink/output-parquets")

BTW, I'm curious about how did you do it without partitionBy usingsaveAsHadoopFile?


Cheng

On 9/8/15 2:34 PM, Adrien Mogenet wrote:

Hi there,
We've spent several hours to split our input data into several parquetfiles (or several folders, i.e./datasink/output-parquets/<key>/foobar.parquet), based on alow-cardinality key. This works very well with a when usingsaveAsHadoopFile, but we can't achieve a similar thing with Parquet files.
The only working solution so far is to persist the RDD and then loopover it N times to write N files. That does not look acceptable...
Do you guys have any suggestion to do such an operation?

--

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com <mailto:adrien.moge...@contentsquare.com>
(+33)6.59.16.64.22
http://www.contentsquare.com <http://www.contentsquare.com/>
50, avenue Montaigne - 75008 Paris

Re: Split content into multiple Parquet files

Reply via email to