leprechaunt33 commented on issue #33049:
URL: https://github.com/apache/arrow/issues/33049#issuecomment-1459950253

   I managed to get to this a bit earlier and it definitely looks like its a 
take issue.  I printed out the indices and sizes of data structures for calls 
to pyarrow take.  I see this for the column in question (details are 
sys.getsizeof data, len(indices) and the indices themselves.
   2175981410 16 [  0,  566984,  568042,  987100,  1021224,  1082097,  1097740, 
 1499272,  1505009,  1537374,  1598404,  1749420,  1818868,  1890281,  1890379, 
 1893484]
   1444605469 10 [
     0,  566984,  568042,  987100,  1021224,  1082097,  1097740,  1499272,  
1505009,  1537374]
   600036404 6 [
     0,  151016,  220464,  291877,  291975,  295080
   ]
   
   The first of the 3 above is the column without using an iterator, the second 
and third is using an iterator with chunk size (number of rows) 10.  If I'm 
reading things right here, it is the fact that the data structure being taken 
from has hit the limit, not the size of the data being collated.  I also ran a 
memory_profile on the same code and determined both the iterator and the 
to_pandas_df both cause allocation of almost identical amounts of memory 
(around 165MB, hardly a memory issue).
   
   So I guess the next question is going to be whether the slice() fixes the 
problem with indexing these large memory mapped arrays.  Will watch with 
interest :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to