jonkeane commented on issue #43742: URL: https://github.com/apache/arrow/issues/43742#issuecomment-2295417454
Thanks for this issue. I took a look at the reprex and I can indeed trigger the same thing. What's happening is that the index itself is very large (like you note, 400MB) and when we convert that into an ascii serialization (which we do for all metadata objects), it gets even larger (0.8GB). Additionally, Size of the overall attributes of the data.table, though this is actually both `row.names` and the index combined. [We by default remove](https://github.com/apache/arrow/blob/1ae38d0d42c1ae5800e42b613f22593673b7370c/r/R/metadata.R#L221-L239) the `row.names`, so those won't end up contributing to the size of the metadata. ``` > print(object.size(attributes(dt)), units = "Gb") 0.7 Gb ``` The index alone is 400MB: ``` > print(object.size(attr(dt, "index")), units = "Gb") 0.4 Gb ``` And when we serialize the attributes, we end up with an object that's ~0.7GB. This was odd to me at first, how is the metadata larger after serialization than it was before? This is because we make an ascii representation of what was integers. We do also compress this, but since this particular metadata is worst-case the compression doesn't actually help too much here. ``` > print(object.size(arrow:::.serialize_arrow_r_metadata(attributes(dt))), units = "Gb") 0.7 Gb ``` We have (at least) two things we could do, and probably should do the first one for sure, the second one is harder: 1. Warn / prevent writing metadata that can't be read back in. We (can) know what the thrift buffer restriction is we should not let arrow write out metadata that can't later by written in. 2. Could we do something smarter about serializing long vectors of integers? Ini principle, we should be able to get this metadata down to at least what it is in memory. But I don't yet have a great idea of how to do that in this case off the top of my head. From a functional perspective: how bad would it be to remove the index when it is too large? Would that disrupt people's flows with data.table? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
