mr-brobot opened a new issue, #8011:
URL: https://github.com/apache/iceberg/issues/8011

   ### Feature Request / Improvement
   
   ## Problem
   
   PyIceberg currently relies on 
[`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) in 
[`Table.plan_files`](https://github.com/apache/iceberg/blob/a73f10b3e7a98d7efcab7e01382b78bffdc7028e/python/pyiceberg/table/__init__.py#L776)
 and [the pyarrow 
interface](https://github.com/apache/iceberg/blob/a73f10b3e7a98d7efcab7e01382b78bffdc7028e/python/pyiceberg/io/pyarrow.py#L873C49-L873C49).
 Unfortunately, `multiprocessing` relies on `/dev/shm`, which is not provided 
by serverless runtimes like [AWS 
Lambda](https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/)
 and [AWS Fargate](https://github.com/aws/containers-roadmap/issues/710). In 
effect, reliance on `/dev/shm` assumes the user has control over the host 
environment and thus disqualifies use in serverless environments.
   
   
[Apparently](https://github.com/lambci/docker-lambda/issues/75#issuecomment-668897239),
 one way to emulate Lambda or Fargate container runtimes locally is by running 
a container with 
[`--ipc="none"`](https://docs.docker.com/engine/reference/run/#ipc-settings---ipc).
 This will disable the `/dev/shm` mount and cause the `multiprocessing` module 
to fail with the following error:
   
   ```python
   [ERROR] OSError: [Errno 38] Function not implemented
   Traceback (most recent call last):
     File "/var/task/app.py", line 27, in handler
       result = scan.to_arrow().slice_length(limit)
     File "/var/task/pyiceberg/table/__init__.py", line 819, in to_arrow
       self.plan_files(),
     File "/var/task/pyiceberg/table/__init__.py", line 776, in plan_files
       with ThreadPool() as pool:
     File "/var/lang/lib/python3.10/multiprocessing/pool.py", line 930, in 
__init__
       Pool.__init__(self, processes, initializer, initargs)
     File "/var/lang/lib/python3.10/multiprocessing/pool.py", line 196, in 
__init__
       self._change_notifier = self._ctx.SimpleQueue()
     File "/var/lang/lib/python3.10/multiprocessing/context.py", line 113, in 
SimpleQueue
       return SimpleQueue(ctx=self.get_context())
     File "/var/lang/lib/python3.10/multiprocessing/queues.py", line 341, in 
__init__
       self._rlock = ctx.Lock()
     File "/var/lang/lib/python3.10/multiprocessing/context.py", line 68, in 
Lock
       return Lock(ctx=self.get_context())
     File "/var/lang/lib/python3.10/multiprocessing/synchronize.py", line 162, 
in __init__
       SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
     File "/var/lang/lib/python3.10/multiprocessing/synchronize.py", line 57, 
in __init__
       sl = self._semlock = _multiprocessing.SemLock(
   ```
   
   Interesting note from 
[`multiprocessing.pool.ThreadPool`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.ThreadPool):
   > A `ThreadPool` shares the same interface as `Pool`, which is designed 
around a pool of processes and predates the introduction of the 
`concurrent.futures` module [...] Users should generally prefer to use 
`concurrent.futures.ThreadPoolExecutor`, which has a simpler interface that was 
designed around threads from the start, and which returns 
`concurrent.futures.Future` instances that are compatible with many other 
libraries, including `asyncio`.
   
   ## Proposal
   
   Perhaps PyIceberg should support multiple concurrency strategies, allowing 
the user to configure which is most appropriate for their runtime/resources.
   
   Instead of using `multiprocessing` directly, we could instead depend on a 
concrete implementation of an 
[`Executor`](https://docs.python.org/3/library/concurrent.futures.html#executor-objects)
 from the 
[concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) 
module wherever we need concurrency. The user can select the appropriate 
implementation via configuration:
   
   - `PYICEBERG__CONCURRENCY=process` uses 
[`ProcessPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor)
 (default, same as current implementation)
   - `PYICEBERG__CONCURRENCY=thread` uses 
[`ThreadPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor)
 (appropriate for serverless environments)
   
   This might even allow PyIceberg to support other concurrency models in the 
future, e.g., user-defined implementations of `Executor`.
   
   I reproduced this problem and have a fix on a fork, confirming that this 
approach at least works. Depending on feedback from the community, I can tidy 
things up and submit a PR. 😃 
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to