Re: [PR] [python] Introduce where for Python CLI table read [paimon]

via GitHub Mon, 09 Mar 2026 22:44:01 -0700


Copilot commented on code in PR #7389:
URL: https://github.com/apache/paimon/pull/7389#discussion_r2909476656



##########
docs/content/pypaimon/cli.md:
##########
@@ -94,10 +95,53 @@ paimon table read mydb.users -l 50
 # Read specific columns
 paimon table read mydb.users -s id,name,age
 
-# Combine select and limit
-paimon table read mydb.users -s id,name -l 50
+# Filter with WHERE clause
+paimon table read mydb.users --where "age > 18"
+
+# Combine select, where, and limit
+paimon table read mydb.users -s id,name -w "age >= 20 AND city = 'Beijing'" -l 
50
 ```
 
+**WHERE Operators**
+
+The `--where` option supports SQL-like filter expressions:
+
+| Operator | Example |
+|---|---|
+| `=`, `!=`, `<>` | `name = 'Alice'` |
+| `<`, `<=`, `>`, `>=` | `age > 18` |
+| `IS NULL`, `IS NOT NULL` | `deleted_at IS NULL` |
+| `IN (...)`, `NOT IN (...)` | `status IN ('active', 'pending')` |
+| `BETWEEN ... AND ...` | `age BETWEEN 20 AND 30` |
+| `LIKE` | `name LIKE 'A%'` |

Review Comment:
   The `NOT BETWEEN` operator is supported by the parser (and tested in 
`where_parser_test.py`) but is missing from the WHERE Operators table in the 
documentation. Both `NOT IN` and `NOT BETWEEN` are useful operators that users 
may want to know about. The table should include a row for `NOT BETWEEN ... AND 
...` alongside the existing `BETWEEN ... AND ...` entry.



##########
paimon-python/pypaimon/cli/cli_table.py:
##########
@@ -63,21 +63,46 @@ def cmd_table_read(args):
     # Build read pipeline
     read_builder = table.new_read_builder()
     
-    # Apply projection (select columns) if specified
+    available_fields = set(field.name for field in table.table_schema.fields)
+
+    # Parse select and where options
     select_columns = args.select
+    where_clause = args.where
+    user_columns = None
+    extra_where_columns = []
+
     if select_columns:
         # Parse column names (comma-separated)
-        columns = [col.strip() for col in select_columns.split(',')]
-        
+        user_columns = [col.strip() for col in select_columns.split(',')]
+
         # Validate that all columns exist in the table schema
-        available_fields = set(field.name for field in 
table.table_schema.fields)
-        invalid_columns = [col for col in columns if col not in 
available_fields]
-        
+        invalid_columns = [col for col in user_columns if col not in 
available_fields]
         if invalid_columns:
             print(f"Error: Column(s) {invalid_columns} do not exist in table 
'{table_identifier}'.", file=sys.stderr)
             sys.exit(1)
-        
-        read_builder = read_builder.with_projection(columns)
+
+    # When both select and where are specified, ensure where-referenced fields
+    # are included in the projection so the filter can work correctly.
+    if user_columns and where_clause:
+        from pypaimon.cli.where_parser import extract_fields_from_where
+        where_fields = extract_fields_from_where(where_clause, 
available_fields)
+        user_column_set = set(user_columns)
+        extra_where_columns = [f for f in where_fields if f not in 
user_column_set]

Review Comment:
   `where_fields` is returned as a `set` from `extract_fields_from_where`, so 
iterating over it to build `extra_where_columns` at line 90 gives a 
non-deterministic ordering. This makes the `projection_columns` list (and thus 
the projected `read_type`) non-deterministic when more than one 
WHERE-referenced field is missing from the user's selection. Consistently 
deterministic behavior (and especially correct predicate-index computation 
after fixing the related index bug) requires that `extra_where_columns` have a 
stable, reproducible order. Consider preserving the order from the table schema 
instead, e.g., building it as `[f.name for f in table.table_schema.fields if 
f.name not in user_column_set and f.name in where_fields]`.



##########
paimon-python/pypaimon/cli/cli_table.py:
##########
@@ -63,21 +63,46 @@ def cmd_table_read(args):
     # Build read pipeline
     read_builder = table.new_read_builder()
     
-    # Apply projection (select columns) if specified
+    available_fields = set(field.name for field in table.table_schema.fields)
+
+    # Parse select and where options
     select_columns = args.select
+    where_clause = args.where
+    user_columns = None
+    extra_where_columns = []
+
     if select_columns:
         # Parse column names (comma-separated)
-        columns = [col.strip() for col in select_columns.split(',')]
-        
+        user_columns = [col.strip() for col in select_columns.split(',')]
+
         # Validate that all columns exist in the table schema
-        available_fields = set(field.name for field in 
table.table_schema.fields)
-        invalid_columns = [col for col in columns if col not in 
available_fields]
-        
+        invalid_columns = [col for col in user_columns if col not in 
available_fields]
         if invalid_columns:
             print(f"Error: Column(s) {invalid_columns} do not exist in table 
'{table_identifier}'.", file=sys.stderr)
             sys.exit(1)
-        
-        read_builder = read_builder.with_projection(columns)
+
+    # When both select and where are specified, ensure where-referenced fields
+    # are included in the projection so the filter can work correctly.
+    if user_columns and where_clause:
+        from pypaimon.cli.where_parser import extract_fields_from_where
+        where_fields = extract_fields_from_where(where_clause, 
available_fields)
+        user_column_set = set(user_columns)
+        extra_where_columns = [f for f in where_fields if f not in 
user_column_set]
+        projection_columns = user_columns + extra_where_columns
+        read_builder = read_builder.with_projection(projection_columns)
+    elif user_columns:
+        read_builder = read_builder.with_projection(user_columns)
+
+    # Apply where filter if specified
+    if where_clause:
+        from pypaimon.cli.where_parser import parse_where_clause
+        try:
+            predicate = parse_where_clause(where_clause, 
table.table_schema.fields)

Review Comment:
   When both `--select` and `--where` are specified, the predicate is built 
from `table.table_schema.fields` (the full schema) at line 100, giving each 
predicate field an index based on the full schema's field order. However, for 
primary-key tables, `FilterRecordReader.predicate.test(record)` accesses record 
fields by `predicate.index`, and those records only contain the projected 
fields (in projected order). If a WHERE-referenced field is not at the same 
position in the projected schema as in the full schema (e.g., full schema: 
`[id(0), name(1), city(2), age(3)]`, projection: `['name', 'age']`, then `age` 
is at index 1 in the projected schema but at index 3 in the full schema), the 
filter will access the wrong field, returning incorrect results or raising an 
`IndexError`.
   
   The fix is to pass the effective projected fields—`projection_columns` 
(which already includes all WHERE-referenced fields via 
`extra_where_columns`)—to `parse_where_clause` instead of 
`table.table_schema.fields`. Since `projection_columns` always contains all 
fields referenced by the WHERE clause, type information is preserved and 
predicate indices will correctly reflect the projected read-type order.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [python] Introduce where for Python CLI table read [paimon]

Reply via email to