[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero？

Weston Pace (Jira) Tue, 20 Sep 2022 17:42:03 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607435#comment-17607435
 ]


Weston Pace commented on ARROW-17740:
-------------------------------------

Thanks for uploading the new example.  I was able to get it to compile and play 
around with it.  There is only one thing that jumped out at me that could cause 
segmentation fault.  When you run the plan:

{noformat}
  // validate the Execplan
  plan->Validate();
  // start the Execplan
  plan->StartProducing();
  std::shared_ptr<arrow::Table> response_table;
  response_table =
      arrow::Table::FromRecordBatchReader(sink_reader.get()).ValueOrDie();
  std::cout << "Result: " << response_table->ToString() << std::endl;
  plan->StopProducing();
  plan->finished();
{noformat}

The call to {{plan->finished()}} actually returns a {{Future<>}} object. You 
are not waiting for this future to complete.  The {{ExecPlan}} can not be 
safely destroyed until this future has completed.  If this were getting 
destroyed early I would expect you would see an error logged to stderr:

{noformat}
Plan was destroyed before finishing
{noformat}

Instead you probably want...

{noformat}
plan->finished().result().ValueOrDie();
{noformat}

There are also a few calls like {{plan->StartProducing()}} and 
{{plan->Validate()}} which return {{arrow::Status}} but you are not inspecting 
the result and so you might miss errors.

Regarding performance.  You are currently reading each file into a table and 
then using a table_source node to read from the in-memory table.  Acero can 
work this way but it really shines when your sources are files.  This is what 
the "scan" node is designed for.  The scan node has some advantages:

 * The join will start processing while the I/O is happening (e.g. we can join 
on batch N while reading batch N-1 from disk)
 * The scan node will read in parallel and optimize I/O for full-file reading 
with pre-buffering, etc.

I'm uploading a new version of the test_join.cpp program which demonstrates 
using a scanner. [^test_join.cpp] 


> [c++][compute]Is there any other way to use Join besides Acero？
> ---------------------------------------------------------------
>
>                 Key: ARROW-17740
>                 URL: https://issues.apache.org/jira/browse/ARROW-17740
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: LinGeLin
>            Priority: Major
>         Attachments: data.zip, join_test.zip, test.cpp, test_join.cpp, 
> v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently！
>  
> In the scenario I'm working on, I'll read one Parquet file and then several 
> other Parquet files. These files will have the same column name (UUID). I 
> need to join (by UUID), project (remove UUID), and filter (some custom 
> filtering) the results of the two reads. I found that Acero could only be 
> used to do join, but when I tested it, Acero performance was very poor and 
> very unstable, coredump often happened. Is there another way? Or just another 
> way to do a join!
>  
> my project commit: 
> [链接|https://github.com/LinGeLin/io/commit/9b1b06d8d74154f0768bf5258cc3eaa2b9e20701]
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s-  -verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... 
> //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 
> --copt=-mfma --copt=-mavx2 
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl 
> --force-reinstall --no-deps
>  
> run v4test.py to test the dataset
>  
> Data.zip contains several parquet files, which are stored on S3 in my 
> scenario.
> I have copied some of the code into test.cpp and can only see the general 
> flow, not compiled
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero？

Reply via email to