[GitHub] [arrow] wjones127 commented on pull request #15210: ARROW-18400: [C++] ListArray values() doesn't take into account offset

GitBox Thu, 05 Jan 2023 14:25:56 -0800


wjones127 commented on PR #15210:
URL: https://github.com/apache/arrow/pull/15210#issuecomment-1372870801


   I re-ran the original reproduction and it seems memory usage is no longer 
quadratic:
   
   | Num rows | Memory usage (10.0.1) | Memory usage (after) |
   |     ---: |       --:             | ---: |
   | 256k | 2,153,767,662 | 1,102,736,461 |
   | 512k | 8,496,047,798 | 2,185,596,364 |
   
   
   <details>
   <summary>Code for test</summary>
   
   Write test file:
   ```python
   import numpy as np
   import random
   import string
   import tracemalloc
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   _characters = string.ascii_uppercase + string.digits + string.punctuation
   
   def make_random_string(N=10):
       return ''.join(random.choice(_characters) for _ in range(N))
   
   nrows = 256_000
   filename = 'nested_pandas.parquet'
   
   arr_len = 10
   nested_col = []
   for i in range(nrows):
       nested_col.append(np.array(
               [{
                   'a': None if i % 1000 == 0 else np.random.choice(10000, 
size=3).astype(np.int64),
                   'b': None if i % 100 == 0 else random.choice(range(100)),
                   'c': None if i % 10 == 0 else make_random_string(5)
               } for i in range(arr_len)]
           ))
   
   table = pa.table({'c1': nested_col})
   
   # table = pa.table({
   #     'c1': pa.array([list(range(random.randint(1, 20))) for _ in 
range(nrows)])
   # })
   
   # Writing to .parquet and loading it into arrow again
   pq.write_table(table, filename)
   ```
   
   Then measure:
   ```python
   import tracemalloc
   import pyarrow.parquet as pq
   
   filename = 
'/Users/willjones/Documents/arrows/arrow/python/nested_pandas.parquet'
   tracemalloc.start()
   table_from_parquet = pq.read_table(filename)
   
   out = table_from_parquet.to_pandas()
   
   print(tracemalloc.get_traced_memory())
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on pull request #15210: ARROW-18400: [C++] ListArray values() doesn't take into account offset

Reply via email to