subject:"Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon\?"

Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread zhangxiongfei

Hi,
I did some tests on Parquet Files with Spark SQL DataFrame API.
I generated 36 gzip compressed parquet files by Spark SQL and stored them on 
Tachyon,The size of each file is about  222M.Then read them with below code.
val tfs 
=sqlContext.parquetFile(tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick);
Next,I just save this DataFrame onto HDFS with below code.It will generate 36 
parquet files too,but the size of each file is about 265M
tfs.repartition(36).saveAsParquetFile(/user/zhangxf/adClick-parquet-tachyon);
My question is Why the files on HDFS has different size with those on Tachyon 
even though they come from the same original data?


Thanks
Zhang Xiongfei

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin

It's because you did a repartition -- which rearranges all the data.

Parquet uses all kinds of compression techniques such as dictionary
encoding and run-length encoding, which would result in the size difference
when the data is ordered different.

On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei zhangxiongfei0...@163.com
wrote:

 Hi,
 I did some tests on Parquet Files with Spark SQL DataFrame API.
 I generated 36 gzip compressed parquet files by Spark SQL and stored them
 on Tachyon,The size of each file is about  222M.Then read them with below
 code.
 val tfs
 =sqlContext.parquetFile(tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick);
 Next,I just save this DataFrame onto HDFS with below code.It will generate
 36 parquet files too,but the size of each file is about 265M

 tfs.repartition(36).saveAsParquetFile(/user/zhangxf/adClick-parquet-tachyon);
 My question is Why the files on HDFS has different size with those on
 Tachyon even though they come from the same original data?


 Thanks
 Zhang Xiongfei

Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2 matches

Site Navigation

Mail list logo

Footer information