mr-brobot opened a new issue, #8011: URL: https://github.com/apache/iceberg/issues/8011
### Feature Request / Improvement ## Problem PyIceberg currently relies on [`multiprocessing`](https://docs.python.org/3/library/multiprocessing.html) in [`Table.plan_files`](https://github.com/apache/iceberg/blob/a73f10b3e7a98d7efcab7e01382b78bffdc7028e/python/pyiceberg/table/__init__.py#L776) and [the pyarrow interface](https://github.com/apache/iceberg/blob/a73f10b3e7a98d7efcab7e01382b78bffdc7028e/python/pyiceberg/io/pyarrow.py#L873C49-L873C49). Unfortunately, `multiprocessing` relies on `/dev/shm`, which is not provided by serverless runtimes like [AWS Lambda](https://aws.amazon.com/blogs/compute/parallel-processing-in-python-with-aws-lambda/) and [AWS Fargate](https://github.com/aws/containers-roadmap/issues/710). In effect, reliance on `/dev/shm` assumes the user has control over the host environment and thus disqualifies use in serverless environments. [Apparently](https://github.com/lambci/docker-lambda/issues/75#issuecomment-668897239), one way to emulate Lambda or Fargate container runtimes locally is by running a container with [`--ipc="none"`](https://docs.docker.com/engine/reference/run/#ipc-settings---ipc). This will disable the `/dev/shm` mount and cause the `multiprocessing` module to fail with the following error: ```python [ERROR] OSError: [Errno 38] Function not implemented Traceback (most recent call last): File "/var/task/app.py", line 27, in handler result = scan.to_arrow().slice_length(limit) File "/var/task/pyiceberg/table/__init__.py", line 819, in to_arrow self.plan_files(), File "/var/task/pyiceberg/table/__init__.py", line 776, in plan_files with ThreadPool() as pool: File "/var/lang/lib/python3.10/multiprocessing/pool.py", line 930, in __init__ Pool.__init__(self, processes, initializer, initargs) File "/var/lang/lib/python3.10/multiprocessing/pool.py", line 196, in __init__ self._change_notifier = self._ctx.SimpleQueue() File "/var/lang/lib/python3.10/multiprocessing/context.py", line 113, in SimpleQueue return SimpleQueue(ctx=self.get_context()) File "/var/lang/lib/python3.10/multiprocessing/queues.py", line 341, in __init__ self._rlock = ctx.Lock() File "/var/lang/lib/python3.10/multiprocessing/context.py", line 68, in Lock return Lock(ctx=self.get_context()) File "/var/lang/lib/python3.10/multiprocessing/synchronize.py", line 162, in __init__ SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx) File "/var/lang/lib/python3.10/multiprocessing/synchronize.py", line 57, in __init__ sl = self._semlock = _multiprocessing.SemLock( ``` Interesting note from [`multiprocessing.pool.ThreadPool`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.ThreadPool): > A `ThreadPool` shares the same interface as `Pool`, which is designed around a pool of processes and predates the introduction of the `concurrent.futures` module [...] Users should generally prefer to use `concurrent.futures.ThreadPoolExecutor`, which has a simpler interface that was designed around threads from the start, and which returns `concurrent.futures.Future` instances that are compatible with many other libraries, including `asyncio`. ## Proposal Perhaps PyIceberg should support multiple concurrency strategies, allowing the user to configure which is most appropriate for their runtime/resources. Instead of using `multiprocessing` directly, we could instead depend on a concrete implementation of an [`Executor`](https://docs.python.org/3/library/concurrent.futures.html#executor-objects) from the [concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html) module wherever we need concurrency. The user can select the appropriate implementation via configuration: - `PYICEBERG__CONCURRENCY=process` uses [`ProcessPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html#processpoolexecutor) (default, same as current implementation) - `PYICEBERG__CONCURRENCY=thread` uses [`ThreadPoolExecutor`](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor) (appropriate for serverless environments) This might even allow PyIceberg to support other concurrency models in the future, e.g., user-defined implementations of `Executor`. I reproduced this problem and have a fix on a fork, confirming that this approach at least works. Depending on feedback from the community, I can tidy things up and submit a PR. 😃 ### Query engine None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
