hendrikmakait opened a new issue, #38806:
URL: https://github.com/apache/arrow/issues/38806

   ### Describe the enhancement requested
   
   # Problem
   
   In Dask, we need to force deep-copies of `pa.Table`s to ensure that 
views/slices sever references to the original buffers and allow us to free 
memory. From what I understand, there are a few ways to force a copy, but all 
of them come with downsides and/or have a clumsy API (see _Alternatives_)
   
   # Proposal
   
   To give better control over copying `pa.Table`, I propose to add a 
`pa.Table.copy()` method that creates a deep-copy of the table. Ideally, this 
`copy()` method would have a boolean `combine` keyword that would combine 
chunks if `True` and maintain the existing chunking scheme otherwise 
(`default`).
   
   
   # Alternatives
   
   * `pa.Table.take()` and `pa.Table.filter()` could be used, but have the 
additional overhead of evaluating some criterion before copying. Also, this is 
a fairly clumsy API and prone to someone optimizing this that zero-copies are 
performed "if possible".
   * We could manually copy the individual columns using `pa.concat_arrays` and 
compose a new Table from those. However, `pa.concat_arrays` has to acquire the 
GIL when creating the returned Python object, which causes us to run into GIL 
contention due to the convoy effect (https://bugs.python.org/issue7946). 
Basically, something else hogs the GIL and our loop over the columns gets 
slowed down because every time we try to acquire the GIL, we have to wait.
   * `pa.Table.combine_chunks()` copies a column if we have _more than_ a 
single chunk in said column. Once again, we would have to jump through some 
hoops here to ensure that this is the case of fall back to another solution 
that forces a copy.
   
   # Side Comments
   
   Intuitively, I would have thought that `copy.deepcopy(table)` as well as 
`pickle.loads(pickle.dumps(table))` would serve my purpose. From what I can 
see, views/slices copy the entire buffer though. This may be by design to 
ensure that offsets are maintained, but this makes it even more important to 
have the ability to truncate underlying buffers for views/slices to avoid 
having to pickle _all the data_. Am I doing something wrong here?
   
   ```python3
   import copy
   import pickle
   
   import pandas as pd
   import pyarrow as pa
   
   df = pd.DataFrame({'n_legs': [2, 4, 5, 100],
                      'animals': ["Flamingo", "Horse", "Brittle stars", 
"Centipede"]})
   table = pa.Table.from_pandas(df)
   print(f"Original size: 
{table.column('animals').chunks[0].buffers()[1].size}")
   sliced = table.slice(0, 1)
   truncated = pa.concat_arrays([chunk for chunk in 
sliced.column("animals").chunks if chunk is not None])
   print(f"Truncated size: {truncated.buffers()[1].size}")
   deep_copied = copy.deepcopy(sliced)
   print(f"Deepcopied size: 
{deep_copied.column('animals').chunks[0].buffers()[1].size}")
   pickled = pickle.loads(pickle.dumps(sliced))
   print(f"Pickled size: 
{pickled.column('animals').chunks[0].buffers()[1].size}")
   ```
   results in 
   ```
   Original size: 20
   Truncated size: 8
   Deepcopied size: 20
   Pickled size: 20
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to