I was trying to write a parquet file with delta encoding. This page
<https://github.com/apache/parquet-format/blob/master/Encodings.md>, states
that parquet supports three types of delta encodings:

    (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY).

Since spark, pyspark or pyarrow does not allow us to specify the encoding
method. I was curious how one can write a file with delta encoding enabled?

However, I found on the internet that, if I have columns with TimeStamp
type parquet will use delta encoding. So I used the following code in
*Scala* to create a parquet file. But encoding is not a delta.


    val df = Seq(("2018-05-01"),
                ("2018-05-02"),
                ("2018-05-03"),
                ("2018-05-04"),
                ("2018-05-05"),
                ("2018-05-06"),
                ("2018-05-07"),
                ("2018-05-08"),
                ("2018-05-09"),
                ("2018-05-10")
            ).toDF("Id")
    val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
    val df3 = df2.withColumn("Date", (col("Id").cast("date")))

    df3.coalesce(1).write.format("parquet").mode("append").save("date_time2")

parquet-tools shows the following information regarding the written parquet
file.

file schema: spark_schema
--------------------------------------------------------------------------------Id:
         OPTIONAL BINARY L:STRING R:0 D:1Timestamp:   OPTIONAL INT96
R:0 D:1Date:        OPTIONAL INT32 L:DATE R:0 D:1

row group 1: RC:31 TS:1100 OFFSET:4
--------------------------------------------------------------------------------Id:
          BINARY SNAPPY DO:0 FPO:4 SZ:230/487/2.12 VC:31
ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
num_nulls: 0]Timestamp:    INT96 SNAPPY DO:0 FPO:234 SZ:212/436/2.06
VC:31 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[num_nulls: 0, min/max
not defined]Date:         INT32 SNAPPY DO:0 FPO:446 SZ:181/177/0.98
VC:31 ENC:RLE,PLAIN,BIT_PACKED ST:[min: 2018-05-01, max: 2018-05-31,
num_nulls: 0]

As you can see, no column has used delta encoding.

My question is,

1) How can I write a parquet file with delta encoding? (If you can provide
an example code in scala or python that would be great.) 2) How to decide
which "delta encoding": (DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY,
DELTA_BYTE_ARRAY) to use?

Reply via email to