JonasJ-ap commented on code in PR #7148:
URL: https://github.com/apache/iceberg/pull/7148#discussion_r1144149218


##########
python/pyiceberg/table/__init__.py:
##########
@@ -415,3 +416,8 @@ def to_duckdb(self, table_name: str, connection: 
Optional[DuckDBPyConnection] =
         con.register(table_name, self.to_arrow())
 
         return con
+
+    def to_ray(self) -> ray.data.dataset.Dataset:
+        import ray
+
+        return ray.data.from_arrow(self.to_arrow())

Review Comment:
   Thank you for mentioning this. `to_arrow()` will convert all the rows and 
fields(columns) that are included in the current table scan to a `pa.Table`. 
Row filters or field selectors could be applied to the table scan to exclude 
unwanted rows or fields. e.g.
   ```python
    not_nan = table_test_null_nan.scan(row_filter=NotNaN("col_numeric"), 
selected_fields=("idx",)).to_arrow()
   ```
   
   The huge table size will be an issue for `to_arrow` if the table scan ends 
up including the whole table (or too many rows). There is an open PR #7163 that 
may help in this case by adding a limit to the number of rows included in the 
table scan. Here is one example quoted from that PR:
   ```python
   limited_result = table_test_limit.scan(selected_fields=("idx",), 
limit=20).to_arrow()
   ```
   
   In the future, I think there could be more discussions on converting large 
iceberg table to ray dataset. Please let me know if you have more 
questions/concerns on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to