paleolimbot commented on code in PR #846:
URL: https://github.com/apache/sedona-db/pull/846#discussion_r3249600175
##########
python/sedonadb/python/sedonadb/dataframe.py:
##########
@@ -88,6 +88,57 @@ def head(self, n: int = 5) -> "DataFrame":
"""
return self.limit(n)
+ def __getitem__(self, key):
+ """Index into the DataFrame using pandas-style bracket access.
+
+ Three forms are supported:
+
+ - `df["x"]` returns an `Expr` referencing column `x`. Equivalent to
+ `sedonadb.expr.col("x")`. Note that this returns an `Expr`, not a
+ materialized column — the `Series` type that pandas users
+ eventually expect will land in a future phase.
+ - `df[["x", "y"]]` returns a new `DataFrame` with the listed columns
+ (equivalent to `df.select("x", "y")`).
+ - `df[bool_expr]` returns a new `DataFrame` filtered by the boolean
+ expression (equivalent to `df.filter(bool_expr)`).
+
+ Row-position indexing (integers, slices, `.loc`, `.iloc`) is
+ intentionally not supported — SedonaDB has no row ordering or
+ index concept in this scope.
+
+ Examples:
+
+ >>> from sedonadb.expr import col
+ >>> sd = sedona.db.connect()
+ >>> df = sd.sql("SELECT * FROM (VALUES (1, 10), (2, 20), (3, 30))
AS t(x, y)")
+ >>> df["x"]
+ Expr(x)
+ >>> df[["x", "y"]].count()
+ 3
+ >>> df[df["x"] > 1].count()
+ 2
+ """
+ from sedonadb.expr import Expr
+ from sedonadb.expr import col as _col
Review Comment:
These should be module-level imports (and they should be combined). Lazy
imports in this module are only for optional dependencies (pyarrow is the usual
culprit)
##########
python/sedonadb/python/sedonadb/dataframe.py:
##########
@@ -88,6 +88,57 @@ def head(self, n: int = 5) -> "DataFrame":
"""
return self.limit(n)
+ def __getitem__(self, key):
Review Comment:
```suggestion
def __getitem__(self, key: Union[str, int]) -> Expr:
```
Eventually we can give `Expr` a subclass for columns to improve the type
hinting
##########
python/sedonadb/python/sedonadb/dataframe.py:
##########
@@ -88,6 +88,57 @@ def head(self, n: int = 5) -> "DataFrame":
"""
return self.limit(n)
+ def __getitem__(self, key):
+ """Index into the DataFrame using pandas-style bracket access.
+
+ Three forms are supported:
+
+ - `df["x"]` returns an `Expr` referencing column `x`. Equivalent to
+ `sedonadb.expr.col("x")`. Note that this returns an `Expr`, not a
+ materialized column — the `Series` type that pandas users
+ eventually expect will land in a future phase.
+ - `df[["x", "y"]]` returns a new `DataFrame` with the listed columns
+ (equivalent to `df.select("x", "y")`).
+ - `df[bool_expr]` returns a new `DataFrame` filtered by the boolean
+ expression (equivalent to `df.filter(bool_expr)`).
+
+ Row-position indexing (integers, slices, `.loc`, `.iloc`) is
+ intentionally not supported — SedonaDB has no row ordering or
+ index concept in this scope.
+
+ Examples:
+
+ >>> from sedonadb.expr import col
+ >>> sd = sedona.db.connect()
+ >>> df = sd.sql("SELECT * FROM (VALUES (1, 10), (2, 20), (3, 30))
AS t(x, y)")
+ >>> df["x"]
+ Expr(x)
+ >>> df[["x", "y"]].count()
+ 3
+ >>> df[df["x"] > 1].count()
+ 2
+ """
+ from sedonadb.expr import Expr
+ from sedonadb.expr import col as _col
+
+ if isinstance(key, str):
+ return _col(key)
Review Comment:
This should reach into the underlying `DFSchema` and pull out the actual
qualified column expression (displaying a nice error for columns that don't
exist). You'll need this for join expressions.
##########
python/sedonadb/python/sedonadb/dataframe.py:
##########
@@ -88,6 +88,57 @@ def head(self, n: int = 5) -> "DataFrame":
"""
return self.limit(n)
+ def __getitem__(self, key):
+ """Index into the DataFrame using pandas-style bracket access.
+
+ Three forms are supported:
+
+ - `df["x"]` returns an `Expr` referencing column `x`. Equivalent to
+ `sedonadb.expr.col("x")`. Note that this returns an `Expr`, not a
+ materialized column — the `Series` type that pandas users
+ eventually expect will land in a future phase.
+ - `df[["x", "y"]]` returns a new `DataFrame` with the listed columns
+ (equivalent to `df.select("x", "y")`).
+ - `df[bool_expr]` returns a new `DataFrame` filtered by the boolean
+ expression (equivalent to `df.filter(bool_expr)`).
+
+ Row-position indexing (integers, slices, `.loc`, `.iloc`) is
+ intentionally not supported — SedonaDB has no row ordering or
+ index concept in this scope.
+
+ Examples:
+
+ >>> from sedonadb.expr import col
+ >>> sd = sedona.db.connect()
+ >>> df = sd.sql("SELECT * FROM (VALUES (1, 10), (2, 20), (3, 30))
AS t(x, y)")
+ >>> df["x"]
+ Expr(x)
+ >>> df[["x", "y"]].count()
+ 3
+ >>> df[df["x"] > 1].count()
+ 2
+ """
+ from sedonadb.expr import Expr
+ from sedonadb.expr import col as _col
+
+ if isinstance(key, str):
+ return _col(key)
+ if isinstance(key, Expr):
+ return self.filter(key)
+ if isinstance(key, list):
+ for k in key:
+ if not isinstance(k, str):
+ raise TypeError(
+ f"DataFrame[list] expects a list of column names, "
+ f"got {type(k).__name__}"
+ )
+ return self.select(*key)
+ raise TypeError(
+ f"DataFrame indexing is not supported for {type(key).__name__}. "
+ f"Use df['x'] for a column expression, df[['x', 'y']] to project "
+ f"columns, or df[bool_expr] to filter rows."
+ )
Review Comment:
I think we should restrict this to only the first case. The ability to put a
filter or multi-column selection here is pandas-y but prevents type hinting
from working well on the output (i.e., IDEs and LLMs that use the language
server or can use type hints can't auto complete `tab["col"].<tab>`).
This should also support integer indexing (e.g., `tab[0]` gives you the
first column).
FWIW Ibis used to accept the second two cases but deprecated them
(`FutureWarning: Selecting/filtering arbitrary expressions in
`Table.__getitem__` is deprecated and will be removed in version 10.0. Please
use `Table.select` or `Table.filter` instead.`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]