(iceberg-python) branch main updated: Docs: Add section on pandas (#138)

fokko Tue, 14 Nov 2023 09:27:03 -0800

This is an automated email from the ASF dual-hosted git repository.

fokko pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-python.git



The following commit(s) were added to refs/heads/main by this push:
     new adabf09  Docs: Add section on pandas (#138)
adabf09 is described below

commit adabf09feff440116de3c507e4317336c1ceacc6
Author: Fokko Driesprong <[email protected]>
AuthorDate: Tue Nov 14 18:24:37 2023 +0100

    Docs: Add section on pandas (#138)
    
    * Docs: Add section on pandas
    
    * Update mkdocs/docs/api.md
---
 mkdocs/docs/api.md | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/mkdocs/docs/api.md b/mkdocs/docs/api.md
index d716a13..e2f726a 100644
--- a/mkdocs/docs/api.md
+++ b/mkdocs/docs/api.md
@@ -318,7 +318,7 @@ In this case it is up to the engine itself to filter the 
file itself. Below, `to
 <!-- prettier-ignore-start -->
 
 !!! note "Requirements"
-    This requires [PyArrow to be installed](index.md).
+    This requires [`pyarrow` to be installed](index.md).
 
 <!-- prettier-ignore-end -->
 
@@ -346,6 +346,45 @@ tpep_dropoff_datetime: [[2021-04-01 
00:47:59.000000,...,2021-05-01 00:14:47.0000
 
 This will only pull in the files that that might contain matching rows.
 
+### Pandas
+
+<!-- prettier-ignore-start -->
+
+!!! note "Requirements"
+    This requires [`pandas` to be installed](index.md).
+
+<!-- prettier-ignore-end -->
+
+PyIceberg makes it easy to filter out data from a huge table and pull it into 
a Pandas dataframe locally. This will only fetch the relevant Parquet files for 
the query and apply the filter. This will reduce IO and therefore improve 
performance and reduce cost.
+
+```python
+table.scan(
+    row_filter="trip_distance >= 10.0",
+    selected_fields=("VendorID", "tpep_pickup_datetime", 
"tpep_dropoff_datetime"),
+).to_pandas()
+```
+
+This will return a Pandas dataframe:
+
+```
+        VendorID      tpep_pickup_datetime     tpep_dropoff_datetime
+0              2 2021-04-01 00:28:05+00:00 2021-04-01 00:47:59+00:00
+1              1 2021-04-01 00:39:01+00:00 2021-04-01 00:57:39+00:00
+2              2 2021-04-01 00:14:42+00:00 2021-04-01 00:42:59+00:00
+3              1 2021-04-01 00:17:17+00:00 2021-04-01 00:43:38+00:00
+4              1 2021-04-01 00:24:04+00:00 2021-04-01 00:56:20+00:00
+...          ...                       ...                       ...
+116976         2 2021-04-30 23:56:18+00:00 2021-05-01 00:29:13+00:00
+116977         2 2021-04-30 23:07:41+00:00 2021-04-30 23:37:18+00:00
+116978         2 2021-04-30 23:38:28+00:00 2021-05-01 00:12:04+00:00
+116979         2 2021-04-30 23:33:00+00:00 2021-04-30 23:59:00+00:00
+116980         2 2021-04-30 23:44:25+00:00 2021-05-01 00:14:47+00:00
+
+[116981 rows x 3 columns]
+```
+
+It is recommended to use Pandas 2 or later, because it stores the data in an 
[Apache Arrow 
backend](https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i)
 which avoids copies of data.
+
 ### DuckDB
 
 <!-- prettier-ignore-start -->

(iceberg-python) branch main updated: Docs: Add section on pandas (#138)

Reply via email to