Re: [I] Parquet files of indexed data.table objects error in read_parquet [arrow]

via GitHub Sun, 18 Aug 2024 15:27:10 -0700


jonkeane commented on issue #43742:
URL: https://github.com/apache/arrow/issues/43742#issuecomment-2295417454


   Thanks for this issue. 
   
   I took a look at the reprex and I can indeed trigger the same thing. What's 
happening is that the index itself is very large (like you note, 400MB) and 
when we convert that into an ascii serialization (which we do for all metadata 
objects), it gets even larger (0.8GB). Additionally, 
   
   Size of the overall attributes of the data.table, though this is actually 
both `row.names` and the index combined. [We by default 
remove](https://github.com/apache/arrow/blob/1ae38d0d42c1ae5800e42b613f22593673b7370c/r/R/metadata.R#L221-L239)
 the `row.names`, so those won't end up contributing to the size of the 
metadata.
   
   ```
   > print(object.size(attributes(dt)), units = "Gb")
   0.7 Gb
   ```
   
   The index alone is 400MB:
   
   ```
   > print(object.size(attr(dt, "index")), units = "Gb")
   0.4 Gb
   ```
   
   And when we serialize the attributes, we end up with an object that's 
~0.7GB. This was odd to me at first, how is the metadata larger after 
serialization than it was before? This is because we make an ascii 
representation of what was integers. We do also compress this, but since this 
particular metadata is worst-case the compression doesn't actually help too 
much here. 
   
   ```
   > print(object.size(arrow:::.serialize_arrow_r_metadata(attributes(dt))), 
units = "Gb")
   0.7 Gb
   ```
   
   We have (at least) two things we could do, and probably should do the first 
one for sure, the second one is harder:
   
   1. Warn / prevent writing metadata that can't be read back in. We (can) know 
what the thrift buffer restriction is we should not let arrow write out 
metadata that can't later by written in.
   2. Could we do something smarter about serializing long vectors of integers? 
Ini principle, we should be able to get this metadata down to at least what it 
is in memory. But I don't yet have a great idea of how to do that in this case 
off the top of my head.
   
   From a functional perspective: how bad would it be to remove the index when 
it is too large? Would that disrupt people's flows with data.table?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet files of indexed data.table objects error in read_parquet [arrow]

Reply via email to