[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero？

Aldrin Montana (Jira) Fri, 16 Sep 2022 06:31:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605827#comment-17605827
 ]


Aldrin Montana commented on ARROW-17740:
----------------------------------------

is test.cpp just a snippet?

 
{code:cpp}
      ...
      // auto arrow_status = tr.ReadNext(&batch);
      auto record_batchs_result = tr.ToRecordBatches();
      if (!record_batchs_result.ok()) {
        res = arrow::Status::Invalid("response table ToRecordBatches failed: " +
                                                   
record_batchs_result.status().ToString());
        break;
      }
      if (background) {
        next_record_batches_ = record_batchs_result.ValueOrDie();

      } else {
        record_batches_ = record_batchs_result.ValueOrDie();
      }
    } while (0);
    ...
  }
{code}

There is 1 instance of `record_batches_` and there's a member variable 
`record_batchs_`. It's probably an unrelated error, but still seems supsicious.

I also cannot view the project commit. not sure if it's a private fork?

> [c++][compute]Is there any other way to use Join besides Acero？
> ---------------------------------------------------------------
>
>                 Key: ARROW-17740
>                 URL: https://issues.apache.org/jira/browse/ARROW-17740
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: LinGeLin
>            Priority: Major
>         Attachments: data.zip, test.cpp, v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently！
>  
> In the scenario I'm working on, I'll read one Parquet file and then several 
> other Parquet files. These files will have the same column name (UUID). I 
> need to join (by UUID), project (remove UUID), and filter (some custom 
> filtering) the results of the two reads. I found that Acero could only be 
> used to do join, but when I tested it, Acero performance was very poor and 
> very unstable, coredump often happened. Is there another way? Or just another 
> way to do a join!
>  
> my project commit: 
> [链接|https://github.com/tensorflow/io/commit/57f373b352ea0181d65e12ac834ed9b2a3b31ef5a]
>  
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s  --verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... 
> //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 
> --copt=-mfma --copt=-mavx2 
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl 
> --force-reinstall --no-deps
>  
> run v4test.py to test the dataset
>  
> Data.zip contains several parquet files, which are stored on S3 in my 
> scenario.
> I have copied some of the code into test.cpp and can only see the general 
> flow, not compiled
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero？

Reply via email to