This is an automated email from the ASF dual-hosted git repository.
thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git
The following commit(s) were added to refs/heads/main by this push:
new a9bbf38 ARROW-13712: Reading and Writing Compressed Data (#87)
a9bbf38 is described below
commit a9bbf385acb2637c8b42c9b9f141922a7936c2e6
Author: Alessandro Molina <[email protected]>
AuthorDate: Thu Oct 28 11:22:05 2021 +0200
ARROW-13712: Reading and Writing Compressed Data (#87)
* ARROW-13712: Reading and Writing Compressed Data
* Apply suggestions from code review
Co-authored-by: Nic <[email protected]>
* Rewording
* Rewording
* Update python/source/io.rst
Co-authored-by: Weston Pace <[email protected]>
Co-authored-by: Nic <[email protected]>
Co-authored-by: Weston Pace <[email protected]>
---
python/source/io.rst | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 119 insertions(+), 1 deletion(-)
diff --git a/python/source/io.rst b/python/source/io.rst
index db03d74..1071be5 100755
--- a/python/source/io.rst
+++ b/python/source/io.rst
@@ -577,4 +577,122 @@ The content of the file can be read back to a
:class:`pyarrow.Table` using
.. testoutput::
- {'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]}
\ No newline at end of file
+ {'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]}
+
+Writing Compressed Data
+=======================
+
+Arrow provides support for writing files in compressed formats,
+both for formats that provide compression natively like Parquet or Feather,
+and for formats that don't support compression out of the box like CSV.
+
+Given a table:
+
+.. testcode::
+
+ table = pa.table([
+ pa.array([1, 2, 3, 4, 5])
+ ], names=["numbers"])
+
+Writing compressed Parquet or Feather data is driven by the
+``compression`` argument to the :func:`pyarrow.feather.write_feather` and
+:func:`pyarrow.parquet.write_table` functions:
+
+.. testcode::
+
+ pa.feather.write_feather(table, "compressed.feather",
+ compression="lz4")
+ pa.parquet.write_table(table, "compressed.parquet",
+ compression="lz4")
+
+You can refer to each of those functions' documentation for a complete
+list of supported compression formats.
+
+.. note::
+
+ Arrow actually uses compression by default when writing
+ Parquet or Feather files. Feather is compressed using ``lz4``
+ by default and Parquet uses ``snappy`` by default.
+
+For formats that don't support compression natively, like CSV,
+it's possible to save compressed data using
+:class:`pyarrow.CompressedOutputStream`:
+
+.. testcode::
+
+ with pa.CompressedOutputStream("compressed.csv.gz", "gzip") as out:
+ pa.csv.write_csv(table, out)
+
+This requires decompressing the file when reading it back,
+which can be done using :class:`pyarrow.CompressedInputStream`
+as explained in the next recipe.
+
+Reading Compressed Data
+=======================
+
+Arrow provides support for reading compressed files,
+both for formats that provide it natively like Parquet or Feather,
+and for files in formats that don't support compression natively,
+like CSV, but have been compressed by an application.
+
+Reading compressed formats that have native support for compression
+doesn't require any special handling. We can for example read back
+the Parquet and Feather files we wrote in the previous recipe
+by simply invoking :meth:`pyarrow.feather.read_table` and
+:meth:`pyarrow.parquet.read_table`:
+
+.. testcode::
+
+ table_feather = pa.feather.read_table("compressed.feather")
+ print(table_feather)
+
+.. testoutput::
+
+ pyarrow.Table
+ numbers: int64
+
+.. testcode::
+
+ table_parquet = pa.parquet.read_table("compressed.parquet")
+ print(table_parquet)
+
+.. testoutput::
+
+ pyarrow.Table
+ numbers: int64
+
+Reading data from formats that don't have native support for
+compression instead involves decompressing them before decoding them.
+This can be done using the :class:`pyarrow.CompressedInputStream` class
+which wraps files with a decompress operation before the result is
+provided to the actual read function.
+
+For example to read a compressed CSV file:
+
+.. testcode::
+
+ with pa.CompressedInputStream(pa.OSFile("compressed.csv.gz"), "gzip") as
input:
+ table_csv = pa.csv.read_csv(input)
+ print(table_csv)
+
+.. testoutput::
+
+ pyarrow.Table
+ numbers: int64
+
+.. note::
+
+ In the case of CSV, arrow is actually smart enough to try detecting
+ compressed files using the file extension. So if your file is named
+ ``*.gz`` or ``*.bz2`` the :meth:`pyarrow.csv.read_csv` function will
+ try to decompress it accordingly
+
+.. testcode::
+
+ table_csv2 = pa.csv.read_csv("compressed.csv.gz")
+ print(table_csv2)
+
+.. testoutput::
+
+ pyarrow.Table
+ numbers: int64