[
https://issues.apache.org/jira/browse/ARROW-7522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17013588#comment-17013588
]
Philipp Moritz commented on ARROW-7522:
---------------------------------------
Disconnecting from the PlasmaClient and letting it go out of scope is fine, the
memory mapped files will still be kept alive. The problem in this code example
is that the buffer of the object is not kept alive (the buffer in the `auto
buffer = object_buffer.data;` line). If that buffer is kept alive, this shared
pointer here
[https://github.com/apache/arrow/blob/b218a7fdae0792e185579d8cd20748ed0752b9ff/cpp/src/plasma/client.cc#L137]
will make sure the PlasmaClient is kept alive, which will make sure the memory
maps are kept alive.
To fix this, we would need some way to set a "base" object of an
`arrow::RecordBatch` (similar to numpy base objects) which will make sure the
backing buffer is kept alive. As a workaround you can also keep the
PlasmaClient alive, but that feel very brittle.
> [C++][Plasma] Broken Record Batch returned from a function call
> ---------------------------------------------------------------
>
> Key: ARROW-7522
> URL: https://issues.apache.org/jira/browse/ARROW-7522
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, C++ - Plasma
> Affects Versions: 0.15.1
> Environment: macOS
> Reporter: Chengxin Ma
> Priority: Minor
>
> Scenario: retrieving Record Batch from Plasma with known Object ID.
> The following code snippet works well:
> {code:java}
> int main(int argc, char **argv)
> {
> plasma::ObjectID object_id =
> plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF");
> // Start up and connect a Plasma client.
> plasma::PlasmaClient client;
> ARROW_CHECK_OK(client.Connect("/tmp/store"));
> plasma::ObjectBuffer object_buffer;
> ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer));
> // Retrieve object data.
> auto buffer = object_buffer.data;
> arrow::io::BufferReader buffer_reader(buffer);
> std::shared_ptr<arrow::ipc::RecordBatchReader> record_batch_stream_reader;
> ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(&buffer_reader,
> &record_batch_stream_reader));
> std::shared_ptr<arrow::RecordBatch> record_batch;
> arrow::Status status =
> record_batch_stream_reader->ReadNext(&record_batch);
> std::cout << "record_batch->column_name(0): " <<
> record_batch->column_name(0) << std::endl;
> std::cout << "record_batch->num_columns(): " <<
> record_batch->num_columns() << std::endl;
> std::cout << "record_batch->num_rows(): " << record_batch->num_rows() <<
> std::endl;
> std::cout << "record_batch->column(0)->length(): "
> << record_batch->column(0)->length() << std::endl;
> std::cout << "record_batch->column(0)->ToString(): "
> << record_batch->column(0)->ToString() << std::endl;
> }
> {code}
> {{record_batch->column(0)->ToString()}} would incur a segmentation fault if
> retrieving Record Batch is wrapped in a function:
> {code:java}
> std::shared_ptr<arrow::RecordBatch> GetRecordBatchFromPlasma(plasma::ObjectID
> object_id)
> {
> // Start up and connect a Plasma client.
> plasma::PlasmaClient client;
> ARROW_CHECK_OK(client.Connect("/tmp/store"));
> plasma::ObjectBuffer object_buffer;
> ARROW_CHECK_OK(client.Get(&object_id, 1, -1, &object_buffer));
> // Retrieve object data.
> auto buffer = object_buffer.data;
> arrow::io::BufferReader buffer_reader(buffer);
> std::shared_ptr<arrow::ipc::RecordBatchReader> record_batch_stream_reader;
> ARROW_CHECK_OK(arrow::ipc::RecordBatchStreamReader::Open(&buffer_reader,
> &record_batch_stream_reader));
> std::shared_ptr<arrow::RecordBatch> record_batch;
> arrow::Status status =
> record_batch_stream_reader->ReadNext(&record_batch);
> // Disconnect the client.
> ARROW_CHECK_OK(client.Disconnect());
> return record_batch;
> }
> int main(int argc, char **argv)
> {
> plasma::ObjectID object_id =
> plasma::ObjectID::from_binary("0FF1CE00C0FFEE00BEEF");
> std::shared_ptr<arrow::RecordBatch> record_batch =
> GetRecordBatchFromPlasma(object_id);
> std::cout << "record_batch->column_name(0): " <<
> record_batch->column_name(0) << std::endl;
> std::cout << "record_batch->num_columns(): " <<
> record_batch->num_columns() << std::endl;
> std::cout << "record_batch->num_rows(): " << record_batch->num_rows() <<
> std::endl;
> std::cout << "record_batch->column(0)->length(): "
> << record_batch->column(0)->length() << std::endl;
> std::cout << "record_batch->column(0)->ToString(): "
> << record_batch->column(0)->ToString() << std::endl;
> }
> {code}
> The meta info of the Record Batch such as number of columns and rows is still
> available, but I can't see the content of the columns.
> {{lldb}} says that the stop reason is {{EXC_BAD_ACCESS}}, so I think the
> Record Batch is destroyed after {{GetRecordBatchFromPlasma}} finishes. But
> why can I still see the meta info of this Record Batch?
> What is the proper way to get the Record Batch if we insist using a function?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)