hi Arun, Parquet isn't designed for efficient transport of small bits of data, e.g. 1 record at a time. It's designed to compactly store large analytics datasets where dictionary encoding, run-length encoding, and compression is effective at reducing space.
In your example, there are several additional pieces of data stored: * File footer metadata * Row group and column chunk metadata, including "column" statistics (min and max values) * Data page headers, one for each "column" If you compared Avro or Protobuf-based storage of e.g. 1 million record dataset as a single file I would bet that Parquet would be smaller (possible significantly smaller, 5-10x or more) in typical use cases. - Wes On Tue, Feb 26, 2019 at 8:14 AM Arun Manivannan <[email protected]> wrote: > Hi, > > Apologies in advance if this a silly question. > > I was trying to compare the various data formats as an exercise and I > noticed that the size of a parquet output file is huge (2 KB) compared to > thrift (79 bytes), PB (45 bytes) or Avro (333 bytes). I thought the issue > was with Spark and then I used a ThriftParquetWriter to write a single > record (AvroParquetWriter yielded the same result too, the tiny difference > in bytes is because of the schema part). The trouble is that I see 5 > instances of the data in the binary (hence the 2KB) size. > > Could someone explain or point me to a link that explains why this is the > case? > > This is parquet 1.8.2 and I haven't tried any lower version. > > *Binary : * > [image: image.png] > > *Code (Thrift): * > > def serialize(t: TweetThrift, file: File) = { > val dataFileWriter = new ThriftParquetWriter[TweetThrift](new > Path(file.getAbsolutePath), > classOf[TweetThrift], > CompressionCodecName.UNCOMPRESSED) > dataFileWriter.write(t) > dataFileWriter.close() > } > > val tweet = new TweetThrift(1, 123, "Saturday 8th, June", "arunma") > tweet.setText("First tweet") > ParquetSerDe.serialize(tweet, file) > > > *Code (Avro): * > > val file = new File ("serialized_parquet_file.parquet") > val tweet = TweetAvro > .newBuilder > .setTarget(1) > .setId(123) > .setDate("Saturday 8th, June") > .setUser("arunma") > .setText("Parquet tweet") > .build() > > ParquetSerDe.serialize(tweet, TweetAvro.SCHEMA$, file) > > > > Cheers, > > Arun > >
