leprechaunt33 commented on issue #33049: URL: https://github.com/apache/arrow/issues/33049#issuecomment-1463522975
@maartenbreddels it seems from my testing that this may be a third issue also related to take, which only occurs when vaex is forced to do a df.take on rows which contain a string column whose unfiltered in memory representation is larger than 2GB. For example, I have been able to consistently reproduce the bug with the leiiomyosarcoma data set (16 indices above) but not for juvenile polymyositis, which generates indices at the start of the data set (indices 0, 44726, 225143 and subsequently the full 2GB of data is not memory mapped). Wherever I've been able to consistently reproduce the problem, vaex has been attempting a take (df.take in this case which turns into a column take) on high row indices which has caused vaex to read in more than 2GB of data and the exception occurs precisely after the lazy execution attempts to take from that large memory mapped column. I have not noted a speed issue or memory explosion in these cases (testing on the 16 index case above only allocated 165MB), but the ArrowInvalid is consistently yielded. It should be fairly simple to test this by generating a hdf5 data set with arbitrary length string columns and doing a df take on high indices or a small number of indices which span the column. The error is then yielded when attempting to access the data in that column (for example for conversion to pandas) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
