[arrow-cookbook] branch main updated: ARROW-13712: Reading and Writing Compressed Data (#87)

thisisnic Thu, 28 Oct 2021 02:22:14 -0700

This is an automated email from the ASF dual-hosted git repository.

thisisnic pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git



The following commit(s) were added to refs/heads/main by this push:
     new a9bbf38  ARROW-13712: Reading and Writing Compressed Data (#87)
a9bbf38 is described below

commit a9bbf385acb2637c8b42c9b9f141922a7936c2e6
Author: Alessandro Molina <[email protected]>
AuthorDate: Thu Oct 28 11:22:05 2021 +0200

    ARROW-13712: Reading and Writing Compressed Data (#87)
    
    * ARROW-13712: Reading and Writing Compressed Data
    
    * Apply suggestions from code review
    
    Co-authored-by: Nic <[email protected]>
    
    * Rewording
    
    * Rewording
    
    * Update python/source/io.rst
    
    Co-authored-by: Weston Pace <[email protected]>
    
    Co-authored-by: Nic <[email protected]>
    Co-authored-by: Weston Pace <[email protected]>
---
 python/source/io.rst | 120 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 119 insertions(+), 1 deletion(-)

diff --git a/python/source/io.rst b/python/source/io.rst
index db03d74..1071be5 100755
--- a/python/source/io.rst
+++ b/python/source/io.rst
@@ -577,4 +577,122 @@ The content of the file can be read back to a 
:class:`pyarrow.Table` using
 
 .. testoutput::
 
-    {'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]}
\ No newline at end of file
+    {'a': [1, 3, 5, 7], 'b': [2.0, 3.0, 4.0, 5.0], 'c': [1, 2, 3, 4]}
+
+Writing Compressed Data
+=======================
+
+Arrow provides support for writing files in compressed formats,
+both for formats that provide compression natively like Parquet or Feather,
+and for formats that don't support compression out of the box like CSV.
+
+Given a table:
+
+.. testcode::
+
+    table = pa.table([
+        pa.array([1, 2, 3, 4, 5])
+    ], names=["numbers"])
+
+Writing compressed Parquet or Feather data is driven by the
+``compression`` argument to the :func:`pyarrow.feather.write_feather` and
+:func:`pyarrow.parquet.write_table` functions:
+
+.. testcode::
+
+    pa.feather.write_feather(table, "compressed.feather",
+                             compression="lz4")
+    pa.parquet.write_table(table, "compressed.parquet",
+                           compression="lz4")
+
+You can refer to each of those functions' documentation for a complete
+list of supported compression formats.
+
+.. note::
+
+    Arrow actually uses compression by default when writing
+    Parquet or Feather files. Feather is compressed using ``lz4``
+    by default and Parquet uses ``snappy`` by default.
+
+For formats that don't support compression natively, like CSV,
+it's possible to save compressed data using
+:class:`pyarrow.CompressedOutputStream`:
+
+.. testcode::
+
+    with pa.CompressedOutputStream("compressed.csv.gz", "gzip") as out:
+        pa.csv.write_csv(table, out)
+
+This requires decompressing the file when reading it back,
+which can be done using :class:`pyarrow.CompressedInputStream`
+as explained in the next recipe.
+
+Reading Compressed Data
+=======================
+
+Arrow provides support for reading compressed files,
+both for formats that provide it natively like Parquet or Feather,
+and for files in formats that don't support compression natively,
+like CSV, but have been compressed by an application.
+
+Reading compressed formats that have native support for compression
+doesn't require any special handling. We can for example read back
+the Parquet and Feather files we wrote in the previous recipe
+by simply invoking :meth:`pyarrow.feather.read_table` and
+:meth:`pyarrow.parquet.read_table`:
+
+.. testcode::
+
+    table_feather = pa.feather.read_table("compressed.feather")
+    print(table_feather)
+
+.. testoutput::
+
+    pyarrow.Table
+    numbers: int64
+
+.. testcode::
+
+    table_parquet = pa.parquet.read_table("compressed.parquet")
+    print(table_parquet)
+
+.. testoutput::
+
+    pyarrow.Table
+    numbers: int64
+
+Reading data from formats that don't have native support for
+compression instead involves decompressing them before decoding them.
+This can be done using the :class:`pyarrow.CompressedInputStream` class
+which wraps files with a decompress operation before the result is
+provided to the actual read function.
+
+For example to read a compressed CSV file:
+
+.. testcode::
+
+    with pa.CompressedInputStream(pa.OSFile("compressed.csv.gz"), "gzip") as 
input:
+        table_csv = pa.csv.read_csv(input)
+        print(table_csv)
+
+.. testoutput::
+
+    pyarrow.Table
+    numbers: int64
+
+.. note::
+
+    In the case of CSV, arrow is actually smart enough to try detecting
+    compressed files using the file extension. So if your file is named
+    ``*.gz`` or ``*.bz2`` the :meth:`pyarrow.csv.read_csv` function will
+    try to decompress it accordingly
+
+.. testcode::
+
+    table_csv2 = pa.csv.read_csv("compressed.csv.gz")
+    print(table_csv2)
+
+.. testoutput::
+
+    pyarrow.Table
+    numbers: int64

[arrow-cookbook] branch main updated: ARROW-13712: Reading and Writing Compressed Data (#87)

Reply via email to