Thanks a lot, Wes. That makes it very clear.
Just to convince myself, I also ran the parquet-tools dump for a single
column "user" and I could cross check the six instances in the binary and
the dump. I see the first one is a rowgroup level stats, the second one is
a page level stats and finally the pair.
The data used was :
val tweet1 = TweetAvro
.newBuilder
.setTarget(1)
.setId(123)
.setDate("Saturday 8th, June")
.setUser("nus1")
.setText("Parquet tweet1")
.build()
val tweet2 = TweetAvro
.newBuilder
.setTarget(2)
.setId(234)
.setDate("Sunday 9th, June")
.setUser("nus2")
.setText("Parquet tweet2")
.build()
*parquet-tools dump -c user -n serialized_parquet_file.parquet*
row group 0
--------------------------------------------------------------------------------
user: BINARY UNCOMPRESSED DO:0 FPO:235 SZ:49/49/1.00 VC:2
ENC:PLAIN,BIT_PACKED ST:[*min: nus1, max: nus2*, num_nulls: 0]
user TV=2 RL=0 DL=0
----------------------------------------------------------------------------
page 0: DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[*min: nus1, max:
nus2,* num_nulls: 0] SZ:16 VC:2
BINARY user
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 *V:nus1*
value 2: R:0 D:0 *V:nus2*
Thanks a ton, again.
Cheers,
Arun
On Tue, Feb 26, 2019 at 10:14 PM Arun Manivannan <[email protected]> wrote:
> Hi,
>
> Apologies in advance if this a silly question.
>
> I was trying to compare the various data formats as an exercise and I
> noticed that the size of a parquet output file is huge (2 KB) compared to
> thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
> was with Spark and then I used a ThriftParquetWriter to write a single
> record (AvroParquetWriter yielded the same result too, the tiny difference
> in bytes is because of the schema part). The trouble is that I see 5
> instances of the data in the binary (hence the 2KB) size.
>
> Could someone explain or point me to a link that explains why this is the
> case?
>
> This is parquet 1.8.2 and I haven't tried any lower version.
>
> *Binary : *
> [image: image.png]
>
> *Code (Thrift): *
>
> def serialize(t: TweetThrift, file: File) = {
> val dataFileWriter = new ThriftParquetWriter[TweetThrift](new
> Path(file.getAbsolutePath),
> classOf[TweetThrift],
> CompressionCodecName.UNCOMPRESSED)
> dataFileWriter.write(t)
> dataFileWriter.close()
> }
>
> val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
> tweet.setText("First tweet")
> ParquetSerDe.serialize(tweet, file)
>
>
> *Code (Avro): *
>
> val file = new File ("serialized_parquet_file.parquet")
> val tweet = TweetAvro
> .newBuilder
> .setTarget(1)
> .setId(123)
> .setDate("Saturday 8th, June")
> .setUser("arunma")
> .setText("Parquet tweet")
> .build()
>
> ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
>
>
>
> Cheers,
>
> Arun
>
>