[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero？

LinGeLin (Jira) Wed, 21 Sep 2022 19:33:12 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608035#comment-17608035
 ]


LinGeLin commented on ARROW-17740:
----------------------------------

yes，release build. I also did a project at SCAN
{code:java}
//代码占位符
// when left file, column_names = {"uuid", "s1"}
// when right file, column_names = {"uuid", "b1"}
scanner_build->Project(column_names);{code}
Is too many columns affecting performance? Each of my new test data files has a 
thousand columns

> [c++][compute]Is there any other way to use Join besides Acero？
> ---------------------------------------------------------------
>
>                 Key: ARROW-17740
>                 URL: https://issues.apache.org/jira/browse/ARROW-17740
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: LinGeLin
>            Priority: Major
>         Attachments: data.zip, join_test.zip, test.cpp, test_join.cpp, 
> v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently！
>  
> In the scenario I'm working on, I'll read one Parquet file and then several 
> other Parquet files. These files will have the same column name (UUID). I 
> need to join (by UUID), project (remove UUID), and filter (some custom 
> filtering) the results of the two reads. I found that Acero could only be 
> used to do join, but when I tested it, Acero performance was very poor and 
> very unstable, coredump often happened. Is there another way? Or just another 
> way to do a join!
>  
> my project commit: 
> [链接|https://github.com/LinGeLin/io/commit/9b1b06d8d74154f0768bf5258cc3eaa2b9e20701]
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s-  -verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/... 
> //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2 
> --copt=-mfma --copt=-mavx2 
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl 
> --force-reinstall --no-deps
>  
> run v4test.py to test the dataset
>  
> Data.zip contains several parquet files, which are stored on S3 in my 
> scenario.
> I have copied some of the code into test.cpp and can only see the general 
> flow, not compiled
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-17740) [c++][compute]Is there any other way to use Join besides Acero？

Reply via email to