[
https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611404#comment-17611404
]
LinGeLin commented on ARROW-17740:
----------------------------------
I'm giving up on Acero。I have also tested DuckDB and DuckDB also performs
poorly when reading many columns. Any other suggestions? Otherwise, I'm gonna
have to tank.
> [c++][compute]Is there any other way to use Join besides Acero?
> ---------------------------------------------------------------
>
> Key: ARROW-17740
> URL: https://issues.apache.org/jira/browse/ARROW-17740
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: LinGeLin
> Priority: Major
> Attachments: data.zip, image-2022-09-30-14-32-48-405.png,
> join_test.zip, test.cpp, test_join.cpp, test_join1.cpp, v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently!
>
> In the scenario I'm working on, I'll read one Parquet file and then several
> other Parquet files. These files will have the same column name (UUID). I
> need to join (by UUID), project (remove UUID), and filter (some custom
> filtering) the results of the two reads. I found that Acero could only be
> used to do join, but when I tested it, Acero performance was very poor and
> very unstable, coredump often happened. Is there another way? Or just another
> way to do a join!
>
> my project commit:
> [链接|https://github.com/LinGeLin/io/commit/9b1b06d8d74154f0768bf5258cc3eaa2b9e20701]
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s- -verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/...
> //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2
> --copt=-mfma --copt=-mavx2
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl
> --force-reinstall --no-deps
>
> run v4test.py to test the dataset
>
> Data.zip contains several parquet files, which are stored on S3 in my
> scenario.
> I have copied some of the code into test.cpp and can only see the general
> flow, not compiled
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)