This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-ballista.git
The following commit(s) were added to refs/heads/master by this push:
new a465c0db Add basic Python docs and enable information_schema in Python
context (#170)
a465c0db is described below
commit a465c0dbfd91fd09ee9ac7acf4db091eb0355902
Author: Andy Grove <[email protected]>
AuthorDate: Mon Aug 29 20:38:02 2022 -0600
Add basic Python docs and enable information_schema in Python context (#170)
---
docs/source/user-guide/python.md | 72 ++++++++++++++++++++++++++++++++++++----
python/src/ballista_context.rs | 1 +
2 files changed, 66 insertions(+), 7 deletions(-)
diff --git a/docs/source/user-guide/python.md b/docs/source/user-guide/python.md
index 0d42c450..3bd4fe50 100644
--- a/docs/source/user-guide/python.md
+++ b/docs/source/user-guide/python.md
@@ -17,16 +17,74 @@
under the License.
-->
-# Python
+# Ballista Python Bindings
+
+Ballista provides Python bindings, allowing SQL and DataFrame queries to be
executed from the Python shell.
+
+## Connecting to a Cluster
+
+The following code demonstrates how to create a Ballista context and connect
to a scheduler.
```text
>>> import ballista
>>> ctx = ballista.BallistaContext("localhost", 50050)
->>> df = ctx.sql("SELECT 1")
+```
+
+## Registering Tables
+
+Tables can be registered against the context by calling one of the `register`
methods, or by executing SQL.
+
+```text
+>>> ctx.register_parquet("trips", "/mnt/bigdata/nyctaxi")
+```
+
+```text
+>>> ctx.sql("CREATE EXTERNAL TABLE trips STORED AS PARQUET LOCATION
'/mnt/bigdata/nyctaxi'")
+```
+
+## Executing Queries
+
+The `sql` method creates a `DataFrame`. The query is executed when an action
such as `show` or `collect` is executed.
+
+### Showing Query Results
+
+```text
+>>> df = ctx.sql("SELECT count(*) FROM trips")
>>> df.show()
-+----------+
-| Int64(1) |
-+----------+
-| 1 |
-+----------+
++-----------------+
+| COUNT(UInt8(1)) |
++-----------------+
+| 9071244 |
++-----------------+
+```
+
+### Collecting Query Results
+
+The `collect` method executres the query and returns the results in
+[PyArrow](https://arrow.apache.org/docs/python/index.html) record batches.
+
+```text
+>>> df = ctx.sql("SELECT count(*) FROM trips")
+>>> df.collect()
+[pyarrow.RecordBatch
+COUNT(UInt8(1)): int64]
+```
+
+### Viewing Query Plans
+
+The `explain` method can be used to show the logical and physical query plans
for a query.
+
+```text
+>>> df.explain()
++---------------+-------------------------------------------------------------+
+| plan_type | plan |
++---------------+-------------------------------------------------------------+
+| logical_plan | Projection: #COUNT(UInt8(1)) |
+| | Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]] |
+| | TableScan: trips projection=[VendorID] |
+| physical_plan | ProjectionExec: expr=[COUNT(UInt8(1))@0 as COUNT(UInt8(1))] |
+| | ProjectionExec: expr=[9071244 as COUNT(UInt8(1))] |
+| | EmptyExec: produce_one_row=true |
+| | |
++---------------+-------------------------------------------------------------+
```
diff --git a/python/src/ballista_context.rs b/python/src/ballista_context.rs
index 4fd91c62..40e389e7 100644
--- a/python/src/ballista_context.rs
+++ b/python/src/ballista_context.rs
@@ -42,6 +42,7 @@ impl PyBallistaContext {
fn new(py: Python, host: &str, port: u16) -> PyResult<Self> {
let config = BallistaConfig::builder()
.set("ballista.shuffle.partitions", "4")
+ .set("ballista.with_information_schema", "true")
.build()
.map_err(BallistaError::from)?;