Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread zhangxiongfei
Hi,
I did some tests on Parquet Files with Spark SQL DataFrame API.
I generated 36 gzip compressed parquet files by Spark SQL and stored them on 
Tachyon,The size of each file is about  222M.Then read them with below code.
val tfs 
=sqlContext.parquetFile(tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick);
Next,I just save this DataFrame onto HDFS with below code.It will generate 36 
parquet files too,but the size of each file is about 265M
tfs.repartition(36).saveAsParquetFile(/user/zhangxf/adClick-parquet-tachyon);
My question is Why the files on HDFS has different size with those on Tachyon 
even though they come from the same original data?


Thanks
Zhang Xiongfei



Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data.

Parquet uses all kinds of compression techniques such as dictionary
encoding and run-length encoding, which would result in the size difference
when the data is ordered different.

On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei zhangxiongfei0...@163.com
wrote:

 Hi,
 I did some tests on Parquet Files with Spark SQL DataFrame API.
 I generated 36 gzip compressed parquet files by Spark SQL and stored them
 on Tachyon,The size of each file is about  222M.Then read them with below
 code.
 val tfs
 =sqlContext.parquetFile(tachyon://datanode8.bitauto.dmp:19998/apps/tachyon/adClick);
 Next,I just save this DataFrame onto HDFS with below code.It will generate
 36 parquet files too,but the size of each file is about 265M

 tfs.repartition(36).saveAsParquetFile(/user/zhangxf/adClick-parquet-tachyon);
 My question is Why the files on HDFS has different size with those on
 Tachyon even though they come from the same original data?


 Thanks
 Zhang Xiongfei