-------- 转发邮件信息 --------
发件人:"zhangliyun" <kelly...@126.com>
发送日期:2019-12-03 05:56:55
收件人:"Wenchen Fan" <cloud0...@gmail.com>
主题:Re:Re: A question about radd bytes size
Hi Fan:
thanks for reply, I agree that the how the data is stored decides the total
bytes of the table file.
In my experiment, I found that
sequence file with gzip compress is 0.5x of the total byte size calculated in
memory.
parquet file with lzo compress is 0.2x of the total byte size calculated in
memory.
Here the reason why actual hive table size is less than total size calculated
in memory is decided by format sequence, orc, parquet and others.
Or is decided by compress algorithm Or both?
Meanwhile can I directly use org.apache.spark.util.SizeEstimator.estimate(RDD)
to estimate the total size of a rdd? I guess there is some difference between
the actual size and estimated size. So in which case, we can use or in which
case we can not use.
Best Regards
Kelly Zhang
在 2019-12-02 15:54:19,"Wenchen Fan" <cloud0...@gmail.com> 写道:
When we talk about bytes size, we need to specify how the data is stored. For
example, if we cache the dataframe, then the bytes size is the number of bytes
of the binary format of the table cache. If we write to hive tables, then the
bytes size is the total size of the data files of the table.
On Mon, Dec 2, 2019 at 1:06 PM zhangliyun <kelly...@126.com> wrote:
Hi:
I want to get the total bytes of a DataFrame by following function , but when
I insert the DataFrame into hive , I found the value of the function is
different from spark.sql.statistics.totalSize . The
spark.sql.statistics.totalSize is less than the result of following function
getRDDBytes .
def getRDDBytes(df:DataFrame):Long={
df.rdd.getNumPartitions match {
case 0 =>
0
case numPartitions =>
val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
val size = if (rddOfDataframe.isEmpty()) {
0
} else {
rddOfDataframe.reduce(_ + _)
}
size
}
}
Appreciate if you can provide your suggestion.
Best Regards
Kelly Zhang