[
https://issues.apache.org/jira/browse/ARROW-17740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607787#comment-17607787
]
LinGeLin commented on ARROW-17740:
----------------------------------
Unfortunately, when the size of Parquet files reaches about 500M, the
performance still cannot meet our requirements. I generated a CSV file with
1,000 columns and 100,000 rows using gen_csv.py and then converted it to a
Parquet file as a left file. A Parquet file with 1,000 columns and 80,000 rows
is also generated as the right file. Only one column (S1, B1) is read from a
file. The test results are as follows:
duration: 7.41882 s
rows: 80000
speed: 10783.4rows/s
In fact, in our scenario, we use Parquet to store the features of the
recommendation system. The right file will be used to store the features of all
the recommendation algorithm groups, and it will have about 7,000 columns. Left
file stores preprocessing characteristics for a specific group. There are about
350 columns so far, but about 30 columns are arrays of length 30.
> [c++][compute]Is there any other way to use Join besides Acero?
> ---------------------------------------------------------------
>
> Key: ARROW-17740
> URL: https://issues.apache.org/jira/browse/ARROW-17740
> Project: Apache Arrow
> Issue Type: Improvement
> Reporter: LinGeLin
> Priority: Major
> Attachments: data.zip, join_test.zip, test.cpp, test_join.cpp,
> v4test.py
>
>
> Acero performs poorly, and coredump occurs frequently!
>
> In the scenario I'm working on, I'll read one Parquet file and then several
> other Parquet files. These files will have the same column name (UUID). I
> need to join (by UUID), project (remove UUID), and filter (some custom
> filtering) the results of the two reads. I found that Acero could only be
> used to do join, but when I tested it, Acero performance was very poor and
> very unstable, coredump often happened. Is there another way? Or just another
> way to do a join!
>
> my project commit:
> [链接|https://github.com/LinGeLin/io/commit/9b1b06d8d74154f0768bf5258cc3eaa2b9e20701]
> tensorflow ==2.6.2
> you can build tfio as follows:
> ./configure.sh
> bazel build -s- -verbose_failures $BAZEL_OPTIMIZATION //tensorflow_io/...
> //tensorflow_io_gcs_filesystem/... --compilation_mode=opt --copt=-msse4.2
> --copt=-mfma --copt=-mavx2
> python setup.py bdist_wheel --data bazel-bin
> pip install dist/tensorflow_io-0.21.0-cp38-cp38-linux_x86_64.whl
> --force-reinstall --no-deps
>
> run v4test.py to test the dataset
>
> Data.zip contains several parquet files, which are stored on S3 in my
> scenario.
> I have copied some of the code into test.cpp and can only see the general
> flow, not compiled
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)