Hi,
Apologies in advance if this a silly question.
I was trying to compare the various data formats as an exercise and I
noticed that the size of a parquet output file is huge (2 KB) compared to
thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
was with Spark and then I used a ThriftParquetWriter to write a single
record (AvroParquetWriter yielded the same result too, the tiny difference
in bytes is because of the schema part). The trouble is that I see 5
instances of the data in the binary (hence the 2KB) size.
Could someone explain or point me to a link that explains why this is the
case?
This is parquet 1.8.2 and I haven't tried any lower version.
*Binary : *
[image: image.png]
*Code (Thrift): *
def serialize(t: TweetThrift, file: File) = {
val dataFileWriter = new ThriftParquetWriter[TweetThrift](new
Path(file.getAbsolutePath),
classOf[TweetThrift],
CompressionCodecName.UNCOMPRESSED)
dataFileWriter.write(t)
dataFileWriter.close()
}
val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
tweet.setText("First tweet")
ParquetSerDe.serialize(tweet, file)
*Code (Avro): *
val file = new File ("serialized_parquet_file.parquet")
val tweet = TweetAvro
.newBuilder
.setTarget(1)
.setId(123)
.setDate("Saturday 8th, June")
.setUser("arunma")
.setText("Parquet tweet")
.build()
ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
Cheers,
Arun