[
https://issues.apache.org/jira/browse/ARROW-11463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277515#comment-17277515
]
Leonard Lausen commented on ARROW-11463:
----------------------------------------
Thank you for sharing the tests / example code [~apitrou]. Pickle v5 is really
useful. For example, the following code can replicate my use-case for the
Plasma store based on providing a folder in {{/dev/shm}} as {{path}}.
{code:python}
import pickle
import mmap
def shm_pickle(path, tbl):
idx = 0
def buffer_callback(buf):
nonlocal idx
with open(path / f'{idx}.bin', 'wb') as f:
f.write(buf)
idx += 1
with open(path / 'meta.pkl', 'wb') as f:
pickle.dump(tbl, f, protocol=5, buffer_callback=buffer_callback)
def shm_unpickle(path):
num_buffers = len(list(path.iterdir())) - 1 # exclude meta.idx
buffers = []
for idx in range(num_buffers):
f = open(path / f'{idx}.bin', 'rb')
buffers.append(mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ))
with open(path / 'meta.pkl', 'rb') as f:
return pickle.load(f, buffers=buffers)
{code}
> Allow configuration of IpcWriterOptions 64Bit from PyArrow
> ----------------------------------------------------------
>
> Key: ARROW-11463
> URL: https://issues.apache.org/jira/browse/ARROW-11463
> Project: Apache Arrow
> Issue Type: Task
> Components: Python
> Reporter: Leonard Lausen
> Assignee: Tao He
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> For tables with many chunks (2M+ rows, 20k+ chunks), `pyarrow.Table.take`
> will be around 1000x slower compared to the `pyarrow.Table.take` on the table
> with combined chunks (1 chunk). Unfortunately, if such table contains large
> list data type, it's easy for the flattened table to contain more than 2**31
> rows and serialization of the table with combined chunks (eg for Plasma
> store) will fail due to `pyarrow.lib.ArrowCapacityError: Cannot write arrays
> larger than 2^31 - 1 in length`
> I couldn't find a way to enable 64bit support for the serialization as called
> from Python (IpcWriteOptions in Python does not expose the CIpcWriteOptions
> 64 bit setting; further the Python serialization APIs do not allow
> specification of IpcWriteOptions)
> I was able to serialize successfully after changing the default and rebuilding
> {code:c++}
> modified cpp/src/arrow/ipc/options.h
> @@ -42,7 +42,7 @@ struct ARROW_EXPORT IpcWriteOptions {
> /// \brief If true, allow field lengths that don't fit in a signed 32-bit
> int.
> ///
> /// Some implementations may not be able to parse streams created with
> this option.
> - bool allow_64bit = false;
> + bool allow_64bit = true;
>
> /// \brief The maximum permitted schema nesting depth.
> int max_recursion_depth = kMaxNestingDepth;
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)