Hi,

Apologies in advance if this a silly question.

I was trying to compare the various data formats as an exercise and I
noticed that the size of a parquet output file is huge (2 KB) compared to
thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
was with Spark and then I used a ThriftParquetWriter to write a single
record (AvroParquetWriter yielded the same result too, the tiny difference
in bytes is because of the schema part). The trouble is that I see 5
instances of the data in the binary (hence the 2KB) size.

Could someone explain or point me to a link that explains why this is the
case?

This is parquet 1.8.2 and I haven't tried any lower version.

*Binary : *
[image: image.png]

*Code (Thrift): *

def serialize(t: TweetThrift, file: File) = {
  val dataFileWriter = new ThriftParquetWriter[TweetThrift](new
Path(file.getAbsolutePath),
    classOf[TweetThrift],
    CompressionCodecName.UNCOMPRESSED)
  dataFileWriter.write(t)
  dataFileWriter.close()
}

val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
    tweet.setText("First tweet")
    ParquetSerDe.serialize(tweet, file)


*Code (Avro): *

 val file = new File ("serialized_parquet_file.parquet")
    val tweet = TweetAvro
      .newBuilder
      .setTarget(1)
      .setId(123)
      .setDate("Saturday 8th, June")
      .setUser("arunma")
      .setText("Parquet tweet")
      .build()

    ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)



Cheers,

Arun

Reply via email to