leprechaunt33 commented on issue #33049:
URL: https://github.com/apache/arrow/issues/33049#issuecomment-1463522975

   @maartenbreddels it seems from my testing that this may be a third issue 
also related to take, which only occurs when vaex is forced to do a df.take on 
rows which contain a string column whose unfiltered in memory representation is 
larger than 2GB.  For example, I have been able to consistently reproduce the 
bug with the leiiomyosarcoma data set (16 indices above) but not for juvenile 
polymyositis, which generates indices at the start of the data set (indices   
0,  44726,  225143 and subsequently the full 2GB of data is not memory mapped). 
 Wherever I've been able to consistently reproduce the problem, vaex has been 
attempting a take (df.take in this case which turns into a column take) on high 
row indices which has caused vaex to read in more than 2GB of data and the 
exception occurs precisely after the lazy execution attempts to take from that 
large memory mapped column.
   
   I have not noted a speed issue or memory explosion in these cases (testing 
on the 16 index case above only allocated 165MB), but the ArrowInvalid is 
consistently yielded.  It should be fairly simple to test this by generating a 
hdf5 data set with arbitrary length string columns and doing a df take on high 
indices or a small number of indices which span the column.  The error is then 
yielded when attempting to access the data in that column (for example for 
conversion to pandas)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to