JonasJ-ap commented on code in PR #7148:
URL: https://github.com/apache/iceberg/pull/7148#discussion_r1144149218
##########
python/pyiceberg/table/__init__.py:
##########
@@ -415,3 +416,8 @@ def to_duckdb(self, table_name: str, connection:
Optional[DuckDBPyConnection] =
con.register(table_name, self.to_arrow())
return con
+
+ def to_ray(self) -> ray.data.dataset.Dataset:
+ import ray
+
+ return ray.data.from_arrow(self.to_arrow())
Review Comment:
Thank you for mentioning this. `to_arrow()` will convert all the rows and
fields(columns) that are included in the current table scan to a `pa.Table`.
Row filters or field selectors could be applied to the table scan to exclude
unwanted rows or fields. e.g.
```python
not_nan = table_test_null_nan.scan(row_filter=NotNaN("col_numeric"),
selected_fields=("idx",)).to_arrow()
```
The huge table size will be an issue for `to_arrow` if the table scan ends
up including the whole table (or too many rows). There is an open PR #7163 that
may help in this case by adding a limit to the number of rows included in the
table scan. Here is one example quoted from that PR:
```python
limited_result = table_test_limit.scan(selected_fields=("idx",),
limit=20).to_arrow()
```
In the future, I think there could be more discussions on converting large
iceberg table to ray dataset. Please let me know if you have more
questions/concerns on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]