Copilot commented on code in PR #7389:
URL: https://github.com/apache/paimon/pull/7389#discussion_r2909476656
##########
docs/content/pypaimon/cli.md:
##########
@@ -94,10 +95,53 @@ paimon table read mydb.users -l 50
# Read specific columns
paimon table read mydb.users -s id,name,age
-# Combine select and limit
-paimon table read mydb.users -s id,name -l 50
+# Filter with WHERE clause
+paimon table read mydb.users --where "age > 18"
+
+# Combine select, where, and limit
+paimon table read mydb.users -s id,name -w "age >= 20 AND city = 'Beijing'" -l
50
```
+**WHERE Operators**
+
+The `--where` option supports SQL-like filter expressions:
+
+| Operator | Example |
+|---|---|
+| `=`, `!=`, `<>` | `name = 'Alice'` |
+| `<`, `<=`, `>`, `>=` | `age > 18` |
+| `IS NULL`, `IS NOT NULL` | `deleted_at IS NULL` |
+| `IN (...)`, `NOT IN (...)` | `status IN ('active', 'pending')` |
+| `BETWEEN ... AND ...` | `age BETWEEN 20 AND 30` |
+| `LIKE` | `name LIKE 'A%'` |
Review Comment:
The `NOT BETWEEN` operator is supported by the parser (and tested in
`where_parser_test.py`) but is missing from the WHERE Operators table in the
documentation. Both `NOT IN` and `NOT BETWEEN` are useful operators that users
may want to know about. The table should include a row for `NOT BETWEEN ... AND
...` alongside the existing `BETWEEN ... AND ...` entry.
##########
paimon-python/pypaimon/cli/cli_table.py:
##########
@@ -63,21 +63,46 @@ def cmd_table_read(args):
# Build read pipeline
read_builder = table.new_read_builder()
- # Apply projection (select columns) if specified
+ available_fields = set(field.name for field in table.table_schema.fields)
+
+ # Parse select and where options
select_columns = args.select
+ where_clause = args.where
+ user_columns = None
+ extra_where_columns = []
+
if select_columns:
# Parse column names (comma-separated)
- columns = [col.strip() for col in select_columns.split(',')]
-
+ user_columns = [col.strip() for col in select_columns.split(',')]
+
# Validate that all columns exist in the table schema
- available_fields = set(field.name for field in
table.table_schema.fields)
- invalid_columns = [col for col in columns if col not in
available_fields]
-
+ invalid_columns = [col for col in user_columns if col not in
available_fields]
if invalid_columns:
print(f"Error: Column(s) {invalid_columns} do not exist in table
'{table_identifier}'.", file=sys.stderr)
sys.exit(1)
-
- read_builder = read_builder.with_projection(columns)
+
+ # When both select and where are specified, ensure where-referenced fields
+ # are included in the projection so the filter can work correctly.
+ if user_columns and where_clause:
+ from pypaimon.cli.where_parser import extract_fields_from_where
+ where_fields = extract_fields_from_where(where_clause,
available_fields)
+ user_column_set = set(user_columns)
+ extra_where_columns = [f for f in where_fields if f not in
user_column_set]
Review Comment:
`where_fields` is returned as a `set` from `extract_fields_from_where`, so
iterating over it to build `extra_where_columns` at line 90 gives a
non-deterministic ordering. This makes the `projection_columns` list (and thus
the projected `read_type`) non-deterministic when more than one
WHERE-referenced field is missing from the user's selection. Consistently
deterministic behavior (and especially correct predicate-index computation
after fixing the related index bug) requires that `extra_where_columns` have a
stable, reproducible order. Consider preserving the order from the table schema
instead, e.g., building it as `[f.name for f in table.table_schema.fields if
f.name not in user_column_set and f.name in where_fields]`.
##########
paimon-python/pypaimon/cli/cli_table.py:
##########
@@ -63,21 +63,46 @@ def cmd_table_read(args):
# Build read pipeline
read_builder = table.new_read_builder()
- # Apply projection (select columns) if specified
+ available_fields = set(field.name for field in table.table_schema.fields)
+
+ # Parse select and where options
select_columns = args.select
+ where_clause = args.where
+ user_columns = None
+ extra_where_columns = []
+
if select_columns:
# Parse column names (comma-separated)
- columns = [col.strip() for col in select_columns.split(',')]
-
+ user_columns = [col.strip() for col in select_columns.split(',')]
+
# Validate that all columns exist in the table schema
- available_fields = set(field.name for field in
table.table_schema.fields)
- invalid_columns = [col for col in columns if col not in
available_fields]
-
+ invalid_columns = [col for col in user_columns if col not in
available_fields]
if invalid_columns:
print(f"Error: Column(s) {invalid_columns} do not exist in table
'{table_identifier}'.", file=sys.stderr)
sys.exit(1)
-
- read_builder = read_builder.with_projection(columns)
+
+ # When both select and where are specified, ensure where-referenced fields
+ # are included in the projection so the filter can work correctly.
+ if user_columns and where_clause:
+ from pypaimon.cli.where_parser import extract_fields_from_where
+ where_fields = extract_fields_from_where(where_clause,
available_fields)
+ user_column_set = set(user_columns)
+ extra_where_columns = [f for f in where_fields if f not in
user_column_set]
+ projection_columns = user_columns + extra_where_columns
+ read_builder = read_builder.with_projection(projection_columns)
+ elif user_columns:
+ read_builder = read_builder.with_projection(user_columns)
+
+ # Apply where filter if specified
+ if where_clause:
+ from pypaimon.cli.where_parser import parse_where_clause
+ try:
+ predicate = parse_where_clause(where_clause,
table.table_schema.fields)
Review Comment:
When both `--select` and `--where` are specified, the predicate is built
from `table.table_schema.fields` (the full schema) at line 100, giving each
predicate field an index based on the full schema's field order. However, for
primary-key tables, `FilterRecordReader.predicate.test(record)` accesses record
fields by `predicate.index`, and those records only contain the projected
fields (in projected order). If a WHERE-referenced field is not at the same
position in the projected schema as in the full schema (e.g., full schema:
`[id(0), name(1), city(2), age(3)]`, projection: `['name', 'age']`, then `age`
is at index 1 in the projected schema but at index 3 in the full schema), the
filter will access the wrong field, returning incorrect results or raising an
`IndexError`.
The fix is to pass the effective projected fields—`projection_columns`
(which already includes all WHERE-referenced fields via
`extra_where_columns`)—to `parse_where_clause` instead of
`table.table_schema.fields`. Since `projection_columns` always contains all
fields referenced by the WHERE clause, type information is preserved and
predicate indices will correctly reflect the projected read-type order.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]