[
https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607435#comment-17607435
]
Weston Pace commented on ARROW-17740:
-------------------------------------
Thanks for uploading the new example. I was able to get it to compile and play
around with it. There is only one thing that jumped out at me that could cause
segmentation fault. When you run the plan:
{noformat}
// validate the Execplan
plan->Validate();
// start the Execplan
plan->StartProducing();
std::shared_ptr<arrow::Table> response_table;
response_table =
arrow::Table::FromRecordBatchReader(sink_reader.get()).ValueOrDie();
std::cout << "Result: " << response_table->ToString() << std::endl;
plan->StopProducing();
plan->finished();
{noformat}
The call to {{plan->finished()}} actually returns a {{Future<>}} object. You
are not waiting for this future to complete. The {{ExecPlan}} can not be
safely destroyed until this future has completed. If this were getting
destroyed early I would expect you would see an error logged to stderr:
{noformat}
Plan was destroyed before finishing
{noformat}
Instead you probably want...
{noformat}
plan->finished().result().ValueOrDie();
{noformat}
There are also a few calls like {{plan->StartProducing()}} and
{{plan->Validate()}} which return {{arrow::Status}} but you are not inspecting
the result and so you might miss errors.
Regarding performance. You are currently reading each file into a table and
then using a table_source node to read from the in-memory table. Acero can
work this way but it really shines when your sources are files. This is what
the "scan" node is designed for. The scan node has some advantages:
* The join will start processing while the I/O is happening (e.g. we can join
on batch N while reading batch N-1 from disk)
* The scan node will read in parallel and optimize I/O for full-file reading
with pre-buffering, etc.
I'm uploading a new version of the test_join.cpp program which demonstrates
using a scanner. [^test_join.cpp]
> [c++][compute]Is there any other way to use Join besides Acero?
> ---------------------------------------------------------------
>
> Key: ARROW-17740
> URL: https://issues.apache.org/jira/browse/ARROW-17740
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: LinGeLin
> Priority: Major
> Attachments: data.zip, join_test.zip, test.cpp, test_join.cpp,
> v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently!
>
> In the scenario I'm working on, I'll read one Parquet file and then several
> other Parquet files. These files will have the same column name (UUID). I
> need to join (by UUID), project (remove UUID), and filter (some custom
> filtering) the results of the two reads. I found that Acero could only be
> used to do join, but when I tested it, Acero performance was very poor and
> very unstable, coredump often happened. Is there another way? Or just another
> way to do a join!
>
> my project commit:
> [链接|https://github.com/LinGeLin/io/commit/9b1b06d8d74154f0768bf5258cc3eaa2b9e20701]
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s- -verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/...
> //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2
> --copt=-mfma --copt=-mavx2
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl
> --force-reinstall --no-deps
>
> run v4test.py to test the dataset
>
> Data.zip contains several parquet files, which are stored on S3 in my
> scenario.
> I have copied some of the code into test.cpp and can only see the general
> flow, not compiled
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)