Polectron commented on code in PR #7163:
URL: https://github.com/apache/iceberg/pull/7163#discussion_r1145424387


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -517,15 +536,22 @@ def _file_to_table(
         if file_schema is None:
             raise ValueError(f"Missing Iceberg schema in Metadata for file: 
{path}")
 
-        arrow_table = pq.read_table(
-            source=fout,
-            schema=parquet_schema,
-            pre_buffer=True,
-            buffer_size=8 * ONE_MEGABYTE,
-            filters=pyarrow_filter,
+        fragment_scanner = ds.Scanner.from_fragment(
+            fragment=fragment,
+            schema=physical_schema,
+            filter=pyarrow_filter,
             columns=[col.name for col in file_project_schema.columns],
         )
 
+        if limit:
+            arrow_table = fragment_scanner.head(limit)
+            with rows_counter.get_lock():

Review Comment:
   I think you actually made the right choice going for multi-threading because 
the task is [IO-bound and there are potentially thousands of tasks that need to 
be executed](https://superfastpython.com/threadpool-vs-pool-in-python/). Even 
though we use multi-threading which is bound to the GIL [some operations might 
still not be safe](https://superfastpython.com/threadpoolexecutor-thread-safe/) 
(like `+=` which performs a read and a write as two separate operations).
   Also, even though the tasks are run on threads in a `ThreadPool`, 
[`multiprocessing.Value`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Value)
 implements a 
[`multiprocessing.RLock`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.RLock)
 by default wich is compatible with both processes and threads.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to