idavi-bcs commented on issue #36589:
URL: https://github.com/apache/arrow/issues/36589#issuecomment-1781310090

   Unfortunately it looks like this is a well-known bug at this point, reported 
multiple times (#36589, #30302, #27616).  However, I want to point out an 
important impact that I haven't seen mentioned yet.  I have a large data table 
of int8 categoricals (genotype data), but it still fits comfortably in memory, 
and can easily be written to Parquet.  But I cannot *read* the Parquet file 
back into memory, because now it takes 5 times as much space (int32 + int8 in 
memory simultaneously as Pandas tries to cast back to int8).  So my data is 
effectively lost.
   
   In other words, this is not just a performance bug -- it can actually cause 
data loss in the case of large tables!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to