(paimon) branch master updated: [doc] Modify Python API to JVM free (#6242)

lzljs3620320 Thu, 11 Sep 2025 23:55:43 -0700

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git



The following commit(s) were added to refs/heads/master by this push:
     new 1650c98bd5 [doc] Modify Python API to JVM free (#6242)
1650c98bd5 is described below

commit 1650c98bd52c7a62ab75ac65b1054ad78650b2f5
Author: umi <55790489+discivig...@users.noreply.github.com>
AuthorDate: Thu Sep 11 22:26:53 2025 +0800

    [doc] Modify Python API to JVM free (#6242)
---
 docs/content/program-api/python-api.md             | 229 +++++++--------------
 .../pypaimon/tests/reader_append_only_test.py      |   2 +-
 2 files changed, 79 insertions(+), 152 deletions(-)

diff --git a/docs/content/program-api/python-api.md 
b/docs/content/program-api/python-api.md
index 691bcd5cdf..90779ed266 100644
--- a/docs/content/program-api/python-api.md
+++ b/docs/content/program-api/python-api.md
@@ -3,8 +3,9 @@ title: "Python API"
 weight: 5
 type: docs
 aliases:
-- /api/python-api.html
+  - /api/python-api.html
 ---
+
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
@@ -26,99 +27,31 @@ under the License.
 
 # Java-based Implementation For Python API
 
-[Python SDK ](https://github.com/apache/paimon-python) has defined Python API 
for Paimon. Currently, there is only a Java-based implementation.
-
-Java-based implementation will launch a JVM and use `py4j` to execute Java 
code to read and write Paimon table.
+[Python SDK ](https://github.com/apache/paimon-python) has defined Python API 
for Paimon.
 
 ## Environment Settings
 
 ### SDK Installing
 
 SDK is published at [pypaimon](https://pypi.org/project/pypaimon/). You can 
install by
-```shell
-pip install pypaimon
-```
 
-### Java Runtime Environment
-
-This SDK needs JRE 1.8. After installing JRE, make sure that at least one of 
the following conditions is met:
-1. `java` command is available. You can verify it by `java -version`.
-2. `JAVA_HOME` and `PATH` variables are set correctly. For example, you can 
set:
 ```shell
-export JAVA_HOME=/path/to/java-directory
-export PATH=$JAVA_HOME/bin:$PATH
-```
-
-### Set Environment Variables
-
-Because we need to launch a JVM to access Java code, JVM environment need to 
be set. Besides, the java code need Hadoop
-dependencies, so hadoop environment should be set.
-
-#### Java classpath
-
-The package has set necessary paimon core dependencies (Local/Hadoop FileIO, 
Avro/Orc/Parquet format support and 
-FileSystem/Jdbc/Hive catalog), so If you just test codes in local or in hadoop 
environment, you don't need to set classpath.
-
-If you need other dependencies such as OSS/S3 filesystem jars, or special 
format and catalog ,please prepare jars and set 
-classpath via one of the following ways:
-
-1. Set system environment variable: ```export 
_PYPAIMON_JAVA_CLASSPATH=/path/to/jars/*```
-2. Set environment variable in Python code:
-
-```python
-import os
-from pypaimon import constants
-
-os.environ[constants.PYPAIMON_JAVA_CLASSPATH] = '/path/to/jars/*'
-```
-
-#### JVM args (optional)
-
-You can set JVM args via one of the following ways:
-
-1. Set system environment variable: ```export _PYPAIMON_JVM_ARGS='arg1 arg2 
...'```
-2. Set environment variable in Python code:
-
-```python
-import os
-from pypaimon import constants
-
-os.environ[constants.PYPAIMON_JVM_ARGS] = 'arg1 arg2 ...'
-```
-
-#### Hadoop classpath
-
-If the machine is in a hadoop environment, please ensure the value of the 
environment variable HADOOP_CLASSPATH include
-path to the common Hadoop libraries, then you don't need to set hadoop.
-
-Otherwise, you should set hadoop classpath via one of the following ways:
-
-1. Set system environment variable: ```export 
_PYPAIMON_HADOOP_CLASSPATH=/path/to/jars/*```
-2. Set environment variable in Python code:
-
-```python
-import os
-from pypaimon import constants
-
-os.environ[constants.PYPAIMON_HADOOP_CLASSPATH] = '/path/to/jars/*'
+pip install pypaimon
 ```
 
-If you just want to test codes in local, we recommend to use [Flink 
Pre-bundled hadoop 
jar](https://flink.apache.org/downloads/#additional-components).
-
-
 ## Create Catalog
 
 Before coming into contact with the Table, you need to create a Catalog.
 
 ```python
-from pypaimon import Catalog
+from pypaimon.catalog.catalog_factory import CatalogFactory
 
 # Note that keys and values are all string
 catalog_options = {
     'metastore': 'filesystem',
     'warehouse': 'file:///path/to/warehouse'
 }
-catalog = Catalog.create(catalog_options)
+catalog = CatalogFactory.create(catalog_options)
 ```
 
 ## Create Database & Table
@@ -126,19 +59,20 @@ catalog = Catalog.create(catalog_options)
 You can use the catalog to create table for writing data.
 
 ### Create Database (optional)
+
 Table is located in a database. If you want to create table in a new database, 
you should create it.
 
 ```python
 catalog.create_database(
-  name='database_name', 
-  ignore_if_exists=True,    # If you want to raise error if the database 
exists, set False
-  properties={'key': 'value'} # optional database properties
+    name='database_name',
+    ignore_if_exists=True,  # If you want to raise error if the database 
exists, set False
+    properties={'key': 'value'}  # optional database properties
 )
 ```
 
 ### Create Schema
 
-Table schema contains fields definition, partition keys, primary keys, table 
options and comment. 
+Table schema contains fields definition, partition keys, primary keys, table 
options and comment.
 The field definition is described by `pyarrow.Schema`. All arguments except 
fields definition are optional.
 
 Generally, there are two ways to build `pyarrow.Schema`.
@@ -157,16 +91,15 @@ pa_schema = pa.schema([
     ('value', pa.string())
 ])
 
-schema = Schema(
-    pa_schema=pa_schema, 
+schema = Schema.from_pyarrow_schema(
+    pa_schema=pa_schema,
     partition_keys=['dt', 'hh'],
     primary_keys=['dt', 'hh', 'pk'],
     options={'bucket': '2'},
-    comment='my test table'
-)
+    comment='my test table')
 ```
 
-See [Data Types]({{< ref "python-api#data-types" >}}) for all supported 
`pyarrow-to-paimon` data types mapping. 
+See [Data Types]({{< ref "python-api#data-types" >}}) for all supported 
`pyarrow-to-paimon` data types mapping.
 
 Second, if you have some Pandas data, the `pa_schema` can be extracted from 
`DataFrame`:
 
@@ -187,9 +120,9 @@ dataframe = pd.DataFrame(data)
 
 # Get Paimon Schema
 record_batch = pa.RecordBatch.from_pandas(dataframe)
-schema = Schema(
-    pa_schema=record_batch.schema, 
-    partition_keys=['dt', 'hh'], 
+schema = Schema.from_pyarrow_schema(
+    pa_schema=record_batch.schema,
+    partition_keys=['dt', 'hh'],
     primary_keys=['dt', 'hh', 'pk'],
     options={'bucket': '2'},
     comment='my test table'
@@ -204,8 +137,8 @@ After building table schema, you can create corresponding 
table:
 schema = ...
 catalog.create_table(
     identifier='database_name.table_name',
-    schema=schema, 
-    ignore_if_exists=True # If you want to raise error if the table exists, 
set False
+    schema=schema,
+    ignore_if_exists=True  # If you want to raise error if the table exists, 
set False
 )
 ```
 
@@ -217,13 +150,56 @@ The Table interface provides tools to read and write 
table.
 table = catalog.get_table('database_name.table_name')
 ```
 
-## Batch Read
+## Batch Write
+
+Paimon table write is Two-Phase Commit, you can write many times, but once 
committed, no more data can be written.
+
+{{< hint warning >}}
+Currently, the feature of writing multiple times and committing once only 
supports append only table.
+{{< /hint >}}
+
+```python
+table = catalog.get_table('database_name.table_name')
+
+# 1. Create table write and commit
+write_builder = table.new_batch_write_builder()
+table_write = write_builder.new_write()
+table_commit = write_builder.new_commit()
+
+# 2. Write data. Support 3 methods:
+# 2.1 Write pandas.DataFrame
+dataframe = ...
+table_write.write_pandas(dataframe)
+
+# 2.2 Write pyarrow.Table
+pa_table = ...
+table_write.write_arrow(pa_table)
+
+# 2.3 Write pyarrow.RecordBatch
+record_batch = ...
+table_write.write_arrow_batch(record_batch)
+
+# 3. Commit data
+commit_messages = table_write.prepare_commit()
+table_commit.commit(commit_messages)
 
-### Set Read Parallelism
+# 4. Close resources
+table_write.close()
+table_commit.close()
+```
 
-TableRead interface provides parallelly reading for multiple splits. You can 
set `'max-workers': 'N'` in `catalog_options`
-to set thread numbers for reading splits. `max-workers` is 1 by default, that 
means TableRead will read splits sequentially
-if you doesn't set `max-workers`.
+By default, the data will be appended to table. If you want to overwrite 
table, you should use `TableWrite#overwrite`
+API:
+
+```python
+# overwrite whole table
+write_builder.overwrite()
+
+# overwrite partition 'dt=2024-01-01'
+write_builder.overwrite({'dt': '2024-01-01'})
+```
+
+## Batch Read
 
 ### Get ReadBuilder and Perform pushdown
 
@@ -296,7 +272,7 @@ You can also read data into a `pyarrow.RecordBatchReader` 
and iterate record bat
 
 ```python
 table_read = read_builder.new_read()
-for batch in table_read.to_arrow_batch_reader(splits):
+for batch in table_read.to_iterator(splits):
     print(batch)
 
 # pyarrow.RecordBatch
@@ -374,66 +350,17 @@ print(ray_dataset.to_pandas())
 # ...
 ```
 
-## Batch Write
-
-Paimon table write is Two-Phase Commit, you can write many times, but once 
committed, no more data can be written.
-
-{{< hint warning >}}
-Currently, Python SDK doesn't support writing primary key table with 
`bucket=-1`.
-{{< /hint >}}
-
-```python
-table = catalog.get_table('database_name.table_name')
-
-# 1. Create table write and commit
-write_builder = table.new_batch_write_builder()
-table_write = write_builder.new_write()
-table_commit = write_builder.new_commit()
-
-# 2. Write data. Support 3 methods:
-# 2.1 Write pandas.DataFrame
-dataframe = ...
-table_write.write_pandas(dataframe)
-
-# 2.2 Write pyarrow.Table
-pa_table = ...
-table_write.write_arrow(pa_table)
-
-# 2.3 Write pyarrow.RecordBatch
-record_batch = ...
-table_write.write_arrow_batch(record_batch)
-
-# 3. Commit data
-commit_messages = table_write.prepare_commit()
-table_commit.commit(commit_messages)
-
-# 4. Close resources
-table_write.close()
-table_commit.close()
-```
-
-By default, the data will be appended to table. If you want to overwrite 
table, you should use `TableWrite#overwrite` API:
-
-```python
-# overwrite whole table
-write_builder.overwrite()
-
-# overwrite partition 'dt=2024-01-01'
-write_builder.overwrite({'dt': '2024-01-01'})
-```
-
 ## Data Types
 
-| pyarrow                                  | Paimon   | 
-|:-----------------------------------------|:---------|
-| pyarrow.int8()                           | TINYINT  |
-| pyarrow.int16()                          | SMALLINT |
-| pyarrow.int32()                          | INT      |
-| pyarrow.int64()                          | BIGINT   |
-| pyarrow.float16() <br/>pyarrow.float32() | FLOAT    |
-| pyarrow.float64()                        | DOUBLE   |
-| pyarrow.string()                         | STRING   |
-| pyarrow.boolean()                        | BOOLEAN  |
+| pyarrow                                                          | Paimon   
| 
+|:-----------------------------------------------------------------|:---------|
+| pyarrow.int8()                                                   | TINYINT  |
+| pyarrow.int16()                                                  | SMALLINT |
+| pyarrow.int32()                                                  | INT      |
+| pyarrow.int64()                                                  | BIGINT   |
+| pyarrow.float16() <br/>pyarrow.float32()  <br/>pyarrow.float64() | FLOAT    |
+| pyarrow.string()                                                 | STRING   |
+| pyarrow.boolean()                                                | BOOLEAN  |
 
 ## Predicate
 
diff --git a/paimon-python/pypaimon/tests/reader_append_only_test.py 
b/paimon-python/pypaimon/tests/reader_append_only_test.py
index ca22c79fbe..72b421aa94 100644
--- a/paimon-python/pypaimon/tests/reader_append_only_test.py
+++ b/paimon-python/pypaimon/tests/reader_append_only_test.py
@@ -125,7 +125,7 @@ class AoReaderTest(unittest.TestCase):
         p3 = predicate_builder.between('user_id', 0, 6)  # [2/b, 3/c, 4/d, 
5/e, 6/f] left
         p4 = predicate_builder.is_not_in('behavior', ['b', 'e'])  # [3/c, 4/d, 
6/f] left
         p5 = predicate_builder.is_in('dt', ['p1'])  # exclude 3/c
-        p6 = predicate_builder.is_not_null('behavior')    # exclude 4/d
+        p6 = predicate_builder.is_not_null('behavior')  # exclude 4/d
         g1 = predicate_builder.and_predicates([p1, p2, p3, p4, p5, p6])
         read_builder = table.new_read_builder().with_filter(g1)
         actual = self._read_test_table(read_builder)

(paimon) branch master updated: [doc] Modify Python API to JVM free (#6242)

Reply via email to