Fokko commented on code in PR #7163:
URL: https://github.com/apache/iceberg/pull/7163#discussion_r1143985117
##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -484,21 +488,36 @@ def expression_to_pyarrow(expr: BooleanExpression) ->
pc.Expression:
return boolean_expression_visit(expr, _ConvertToArrowExpression())
+@lru_cache
+def _get_file_format(file_format: FileFormat, **kwargs: Dict[str, Any]) ->
ds.FileFormat:
+ if file_format == FileFormat.PARQUET.value:
+ return ds.ParquetFileFormat(**kwargs)
+ elif file_format == FileFormat.ORC.value:
Review Comment:
We want to remove this, and we can implement ORC in
https://github.com/apache/iceberg/pull/7033 because it needs more work.
##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -517,15 +536,22 @@ def _file_to_table(
if file_schema is None:
raise ValueError(f"Missing Iceberg schema in Metadata for file:
{path}")
- arrow_table = pq.read_table(
- source=fout,
- schema=parquet_schema,
- pre_buffer=True,
- buffer_size=8 * ONE_MEGABYTE,
- filters=pyarrow_filter,
+ fragment_scanner = ds.Scanner.from_fragment(
+ fragment=fragment,
+ schema=physical_schema,
+ filter=pyarrow_filter,
columns=[col.name for col in file_project_schema.columns],
)
+ if limit:
+ arrow_table = fragment_scanner.head(limit)
+ with rows_counter.get_lock():
Review Comment:
I think we can remove this lock because we already did the expensive work.
This will make the code a bit simpler and avoid locking.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]