Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/21977 @holdenk pyarrow uses a C++ based memory pool, so I'm not sure how that exactly works with rlimit but I ran some tests and looks like an error is thrown when the limit is set. **with setrlimit** ```python >>> import pyarrow as pa >>> import resource >>> resource.setrlimit(resource.RLIMIT_AS, (1000 * 1024 * 1024, 1000 * 1024 * 1024)) >>> a = list(range(1 << 20)) >>> b = [pa.array(a) for i in range(10)] >>> c = [pa.array(a) for i in range(10)] >>> pa.total_allocated_bytes() 170393600 >>> d = [pa.array(a) for i in range(100)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> File "pyarrow/array.pxi", line 186, in pyarrow.lib.array File "pyarrow/array.pxi", line 26, in pyarrow.lib._sequence_to_array File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status pyarrow.lib.ArrowMemoryError: malloc of size 8388608 failed ``` **no limit** ```python >>> import pyarrow as pa >>> a = list(range(1 << 20)) >>> b = [pa.array(a) for i in range(10)] >>> c = [pa.array(a) for i in range(10)] >>> pa.total_allocated_bytes() 170393600 >>> d = [pa.array(a) for i in range(100)] >>> pa.total_allocated_bytes() 1022361600 ``` One thing I wasn't expecting is it seems like importing pyarrow and it's shared libraries after setting rlimit can fail if it is set too low, and it is not a clean failure - is this expected? ```python >>> import resource >>> resource.setrlimit(resource.RLIMIT_AS, (100 * 1024 * 1024, 100 * 1024 * 1024)) >>> import pyarrow Traceback (most recent call last): File "/home/bryan/miniconda2/envs/pa010py35/lib/python3.5/site-packages/numpy/core/__init__.py", line 16, in <module> from . import multiarray ImportError: libopenblas.so.0: failed to map segment from shared object ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org