This is an automated email from the ASF dual-hosted git repository. lzljs3620320 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push: new bfa48045ef [Python] Update Doc for Read Splits and Data Types (#6254) bfa48045ef is described below commit bfa48045ef60ef922a673b292464d25506ace276 Author: ChengHui Chen <27797326+chenghuic...@users.noreply.github.com> AuthorDate: Mon Sep 15 11:59:50 2025 +0800 [Python] Update Doc for Read Splits and Data Types (#6254) --- docs/content/program-api/python-api.md | 53 ++++++++++++++++++++++------------ 1 file changed, 35 insertions(+), 18 deletions(-) diff --git a/docs/content/program-api/python-api.md b/docs/content/program-api/python-api.md index ab4967a895..a5ab249cc5 100644 --- a/docs/content/program-api/python-api.md +++ b/docs/content/program-api/python-api.md @@ -25,9 +25,7 @@ specific language governing permissions and limitations under the License. --> -# Java-based Implementation For Python API - -[Python SDK ](https://github.com/apache/paimon-python) has defined Python API for Paimon. +# Python API ## Environment Settings @@ -65,7 +63,7 @@ Table is located in a database. If you want to create table in a new database, y ```python catalog.create_database( name='database_name', - ignore_if_exists=True, # If you want to raise error if the database exists, set False + ignore_if_exists=True, # To raise error if the database exists, set False properties={'key': 'value'} # optional database properties ) ``` @@ -138,7 +136,7 @@ schema = ... catalog.create_table( identifier='database_name.table_name', schema=schema, - ignore_if_exists=True # If you want to raise error if the table exists, set False + ignore_if_exists=True # To raise error if the table exists, set False ) ``` @@ -193,10 +191,10 @@ API: ```python # overwrite whole table -write_builder.overwrite() +write_builder = table.new_batch_write_builder().overwrite() # overwrite partition 'dt=2024-01-01' -write_builder.overwrite({'dt': '2024-01-01'}) +write_builder = table.new_batch_write_builder().overwrite({'dt': '2024-01-01'}) ``` ## Batch Read @@ -272,7 +270,7 @@ You can also read data into a `pyarrow.RecordBatchReader` and iterate record bat ```python table_read = read_builder.new_read() -for batch in table_read.to_iterator(splits): +for batch in table_read.to_arrow_batch_reader(splits): print(batch) # pyarrow.RecordBatch @@ -283,6 +281,19 @@ for batch in table_read.to_iterator(splits): # f1: ["a","b","c"] ``` +#### Python Iterator +You can read the data row by row into a native Python iterator. +This is convenient for custom row-based processing logic. + +```python +table_read = read_builder.new_read() +for row in table_read.to_iterator(splits): + print(row) + +# [1,2,3] +# ["a","b","c"] +``` + #### Pandas This requires `pandas` to be installed. @@ -351,16 +362,22 @@ print(ray_dataset.to_pandas()) ``` ## Data Types - -| pyarrow | Paimon | -|:-----------------------------------------------------------------|:---------| -| pyarrow.int8() | TINYINT | -| pyarrow.int16() | SMALLINT | -| pyarrow.int32() | INT | -| pyarrow.int64() | BIGINT | -| pyarrow.float16() <br/>pyarrow.float32() <br/>pyarrow.float64() | FLOAT | -| pyarrow.string() | STRING | -| pyarrow.boolean() | BOOLEAN | +| Python Native Type | PyArrow Type | Paimon Type | +| :--- | :--- | :--- | +| `int` | `pyarrow.int8()` | `TINYINT` | +| `int` | `pyarrow.int16()` | `SMALLINT` | +| `int` | `pyarrow.int32()` | `INT` | +| `int` | `pyarrow.int64()` | `BIGINT` | +| `float` | `pyarrow.float32()` | `FLOAT` | +| `float` | `pyarrow.float64()` | `DOUBLE` | +| `bool` | `pyarrow.bool_()` | `BOOLEAN` | +| `str` | `pyarrow.string()` | `STRING`, `CHAR(n)`, `VARCHAR(n)` | +| `bytes` | `pyarrow.binary()` | `BYTES`, `VARBINARY(n)` | +| `bytes` | `pyarrow.binary(length)` | `BINARY(length)` | +| `decimal.Decimal` | `pyarrow.decimal128(precision, scale)` | `DECIMAL(precision, scale)` | +| `datetime.datetime` | `pyarrow.timestamp(unit, tz=None)` | `TIMESTAMP(p)` | +| `datetime.date` | `pyarrow.date32()` | `DATE` | +| `datetime.time` | `pyarrow.time32(unit)` or `pyarrow.time64(unit)` | `TIME(p)` | ## Predicate