jonkeane commented on issue #45300:
URL: https://github.com/apache/arrow/issues/45300#issuecomment-2629517478

   Circling back here, there's lots of detailed discussion in #45346 but the 
TL;DR is that the warning from `data.table` about `.internal.selfref` is 
working as designed. `.internal.selfref` is a pointer to memory so will not be 
valid (and typically is discarded) after any serialization (either to Parquet 
with `arrow` or RDS, etc). The warning is saying that `data.table` is 
reconstructing that pointer. 
   
   > Our workflow is to work with large data.tables in a targets pipeline, 
saving interim files as parquet files, which targets does using 
arrow::read_parquet - this bug slows down our projects, as data.table creates 
shallow copies of our large data.tables.
   
   Given that there is no way to serialize a `data.table` with this pointer, 
targets + any serialization is going to cause similar performance issues — 
regardless of if it's Parquet via `arrow` or anything else. Maybe the workflow 
can be adjusted to reduce the number of serializations that happen in targets 
to avoid this? That would be where I would start, at least. You might also be 
able to use `data.table::setDT()` when you read in from parquet — IIUC that 
will prevent the warning and reconstruct the pointer as efficiently as possible.
   
   > Note that in arrow version 16.0 and earlier this issue did not occur, 
data.tables make the round trip through read/write_parquet successfully, 
maintaining the ability to set columns by reference.
   
   The version here is surprising, but a red-herring ultimately. I'm not 100% 
sure why before `arrow` version 16 `data.table` didn't raise this error, but I 
am confident that it was not possible then to serialize the pointer in 
`.internal.selfref`, so the performance was likely very similar (even if there 
was no warning about this).
   
   I'm going to close this for now, but feel free to re-open if I've missed 
something.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to