jonkeane commented on issue #45300: URL: https://github.com/apache/arrow/issues/45300#issuecomment-2629517478
Circling back here, there's lots of detailed discussion in #45346 but the TL;DR is that the warning from `data.table` about `.internal.selfref` is working as designed. `.internal.selfref` is a pointer to memory so will not be valid (and typically is discarded) after any serialization (either to Parquet with `arrow` or RDS, etc). The warning is saying that `data.table` is reconstructing that pointer. > Our workflow is to work with large data.tables in a targets pipeline, saving interim files as parquet files, which targets does using arrow::read_parquet - this bug slows down our projects, as data.table creates shallow copies of our large data.tables. Given that there is no way to serialize a `data.table` with this pointer, targets + any serialization is going to cause similar performance issues — regardless of if it's Parquet via `arrow` or anything else. Maybe the workflow can be adjusted to reduce the number of serializations that happen in targets to avoid this? That would be where I would start, at least. You might also be able to use `data.table::setDT()` when you read in from parquet — IIUC that will prevent the warning and reconstruct the pointer as efficiently as possible. > Note that in arrow version 16.0 and earlier this issue did not occur, data.tables make the round trip through read/write_parquet successfully, maintaining the ability to set columns by reference. The version here is surprising, but a red-herring ultimately. I'm not 100% sure why before `arrow` version 16 `data.table` didn't raise this error, but I am confident that it was not possible then to serialize the pointer in `.internal.selfref`, so the performance was likely very similar (even if there was no warning about this). I'm going to close this for now, but feel free to re-open if I've missed something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
