GitHub user MaxGekk opened a pull request:
https://github.com/apache/spark/pull/21902
[SPARK-24952][SQL] Support LZMA2 compression by Avro datasource
## What changes were proposed in this pull request?
In the PR, I propose to support `LZMA2` (`XZ`) and `BZIP2` compressions by
`AVRO` datasource in write since the codecs has much better compression ratio
comparing to already supported `deflate` and `snappy` codecs. To tune
compression level of `XZ`, the PR introduces new SQL config
`spark.sql.avro.xz.level` with default value `6`. Allowed range of levels is
`[0, 9]`.
## How was this patch tested?
It was tested manually and by an existing test which was extended to check
the `xz` and `bzip2` compressions.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MaxGekk/spark-1 avro-xz-bzip2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21902.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21902
----
commit e3b8856c6f8769cf1c2646e7cf5ae41fb3c8d626
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-27T20:15:04Z
Support bzip2
commit 7b9dd253e313fb7b5f674672f8bd5447812522a3
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-27T20:40:18Z
Support xz
commit d4dbeb10656283d957c9c52327da97170f9ad080
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-27T21:12:54Z
Refactoring
commit 3e1139af293cb2e06e125edfd443a5b5a0265b84
Author: Maxim Gekk <maxim.gekk@...>
Date: 2018-07-27T21:30:30Z
Fix comments
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]