Thanks a lot, Wes. That makes it very clear.

Just to convince myself, I also ran the parquet-tools dump for a single
column "user" and I could cross check the six instances in the binary and
the dump.  I see the first one is a rowgroup level stats, the second one is
a page level stats and finally the pair.

The data used was :

val tweet1 = TweetAvro
      .newBuilder
      .setTarget(1)
      .setId(123)
      .setDate("Saturday 8th, June")
      .setUser("nus1")
      .setText("Parquet tweet1")
      .build()


    val tweet2 = TweetAvro
      .newBuilder
      .setTarget(2)
      .setId(234)
      .setDate("Sunday 9th, June")
      .setUser("nus2")
      .setText("Parquet tweet2")
      .build()


*parquet-tools dump -c user -n serialized_parquet_file.parquet*
row group 0
--------------------------------------------------------------------------------
user:  BINARY UNCOMPRESSED DO:0 FPO:235 SZ:49/49/1.00 VC:2
ENC:PLAIN,BIT_PACKED ST:[*min: nus1, max: nus2*, num_nulls: 0]

    user TV=2 RL=0 DL=0

----------------------------------------------------------------------------
    page 0:  DLE:BIT_PACKED RLE:BIT_PACKED VLE:PLAIN ST:[*min: nus1, max:
nus2,* num_nulls: 0] SZ:16 VC:2

BINARY user
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 2 ***
value 1: R:0 D:0 *V:nus1*
value 2: R:0 D:0 *V:nus2*



Thanks a ton, again.

Cheers,
Arun



On Tue, Feb 26, 2019 at 10:14 PM Arun Manivannan <[email protected]> wrote:

> Hi,
>
> Apologies in advance if this a silly question.
>
> I was trying to compare the various data formats as an exercise and I
> noticed that the size of a parquet output file is huge (2 KB) compared to
> thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue
> was with Spark and then I used a ThriftParquetWriter to write a single
> record (AvroParquetWriter yielded the same result too, the tiny difference
> in bytes is because of the schema part). The trouble is that I see 5
> instances of the data in the binary (hence the 2KB) size.
>
> Could someone explain or point me to a link that explains why this is the
> case?
>
> This is parquet 1.8.2 and I haven't tried any lower version.
>
> *Binary : *
> [image: image.png]
>
> *Code (Thrift): *
>
> def serialize(t: TweetThrift, file: File) = {
>   val dataFileWriter = new ThriftParquetWriter[TweetThrift](new 
> Path(file.getAbsolutePath),
>     classOf[TweetThrift],
>     CompressionCodecName.UNCOMPRESSED)
>   dataFileWriter.write(t)
>   dataFileWriter.close()
> }
>
> val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma")
>     tweet.setText("First tweet")
>     ParquetSerDe.serialize(tweet, file)
>
>
> *Code (Avro): *
>
>  val file = new File ("serialized_parquet_file.parquet")
>     val tweet = TweetAvro
>       .newBuilder
>       .setTarget(1)
>       .setId(123)
>       .setDate("Saturday 8th, June")
>       .setUser("arunma")
>       .setText("Parquet tweet")
>       .build()
>
>     ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file)
>
>
>
> Cheers,
>
> Arun
>
>

Reply via email to