fornaix opened a new pull request #32751:
URL: https://github.com/apache/spark/pull/32751


   ### What changes were proposed in this pull request?
   
   This PR aims to support LZ4 compression in the ORC data source.
   
   ### Why are the changes needed?
   
   Apache ORC supports LZ4 compression, but we cannot set LZ4 compression in 
the ORC data source
   
   **BEFORE**
   
   ```scala
   scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4")
   java.lang.IllegalArgumentException: Codec [lz4] is not available. Available 
codecs are uncompressed, lzo, snappy, zlib, none, zstd.
   ```
   
   **AFTER**
   
   ```scala
   scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4")
   ```
   ```bash
   $ orc-tools meta /tmp/lz4                                                    
           
   Processing data file 
file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc 
[length: 222]
   Structure for 
file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc
   File Version: 0.12 with ORC_517
   Rows: 10
   Compression: LZ4
   Compression size: 262144
   Type: struct<id:bigint>
   
   Stripe Statistics:
     Stripe 1:
       Column 0: count: 10 hasNull: false
       Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45
   
   File Statistics:
     Column 0: count: 10 hasNull: false
     Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45
   
   Stripes:
     Stripe: offset: 3 data: 7 rows: 10 tail: 35 index: 35
       Stream: column 0 section ROW_INDEX start: 3 length 11
       Stream: column 1 section ROW_INDEX start: 14 length 24
       Stream: column 1 section DATA start: 38 length 7
       Encoding column 0: DIRECT
       Encoding column 1: DIRECT_V2
   
   File length: 222 bytes
   Padding length: 0 bytes
   Padding ratio: 0%
   
   User Metadata:
     org.apache.spark.version=3.2.0
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   Pass the newly added test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to