Re: Parquet binary file size

Wes McKinney Tue, 26 Feb 2019 07:43:42 -0800

hi Arun,

Parquet isn't designed for efficient transport of small bits of data, e.g.
1 record at a time. It's designed to compactly store large analytics
datasets where dictionary encoding, run-length encoding, and compression is
effective at reducing space.


In your example, there are several additional pieces of data stored:

* File footer metadata
* Row group and column chunk metadata, including "column" statistics (min
and max values)
* Data page headers, one for each "column"

If you compared Avro or Protobuf-based storage of e.g. 1 million record
dataset as a single file I would bet that Parquet would be smaller
(possible significantly smaller, 5-10x or more) in typical use cases.

- Wes

On Tue, Feb 26, 2019 at 8:14 AM Arun Manivannan <[email protected]> wrote:

> Hi,
>
> Apologies in advance if this a silly question.
>
> I was trying to compare the various data formats as an exercise and I
> noticed that the size of a parquet output file is huge (2 KB) compared to
> thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
> was with Spark and then I used a ThriftParquetWriter to write a single
> record (AvroParquetWriter yielded the same result too, the tiny difference
> in bytes is because of the schema part). The trouble is that I see 5
> instances of the data in the binary (hence the 2KB) size.
>
> Could someone explain or point me to a link that explains why this is the
> case?
>
> This is parquet 1.8.2 and I haven't tried any lower version.
>
> *Binary : *
> [image: image.png]
>
> *Code (Thrift): *
>
> def serialize(t: TweetThrift, file: File) = {
>   val dataFileWriter = new ThriftParquetWriter[TweetThrift](new 
> Path(file.getAbsolutePath),
>     classOf[TweetThrift],
>     CompressionCodecName.UNCOMPRESSED)
>   dataFileWriter.write(t)
>   dataFileWriter.close()
> }
>
> val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
>     tweet.setText("First tweet")
>     ParquetSerDe.serialize(tweet, file)
>
>
> *Code (Avro): *
>
>  val file = new File ("serialized_parquet_file.parquet")
>     val tweet = TweetAvro
>       .newBuilder
>       .setTarget(1)
>       .setId(123)
>       .setDate("Saturday 8th, June")
>       .setUser("arunma")
>       .setText("Parquet tweet")
>       .build()
>
>     ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
>
>
>
> Cheers,
>
> Arun
>
>

Re: Parquet binary file size

Reply via email to