[GitHub] [arrow] zhixingheyi-tian commented on pull request #14353: ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type

GitBox Mon, 12 Dec 2022 02:03:10 -0800


zhixingheyi-tian commented on PR #14353:
URL: https://github.com/apache/arrow/pull/14353#issuecomment-1346200212


   > Hi @zhixingheyi-tian, sorry this hasn't gotten reviewer attention for a 
while. Are you still interested in working on this?
   > 
   > Would you mind rebasing and cleaning up any commented code?
   > 
   > In addition, I noticed there aren't any benchmarks for binary or string 
types in either of these two benchmark files:
   > 
   > * `src/parquet/arrow/reader_writer_benchmark.cc`
   > * `src/parquet/parquet_encoding_benchmark.cc`
   > 
   > We'll want some benchmarks in order to evaluate the changes. If you add 
those, then you can compare benchmarks locally like so:
   > 
   > ```shell
   > BUILD_DIR=cpp/path/to/build
   > 
   > archery benchmark run $BUILD_DIR \
   >   --suite-filter=parquet-arrow-reader-writer-benchmark \
   >   --output=contender.json
   > 
   > git checkout master # Or some ref that has benchmarks but not your changes
   > 
   > archery benchmark run $BUILD_DIR \
   >   --suite-filter=parquet-arrow-reader-writer-benchmark \
   >   --output=baseline.json
   > 
   > archery benchmark diff contender.json baseline.json 
   > ```
   
   Hi @wjones127 
   Good idea to show performance.
   
   How to add Binary/String benchmark in *benchmark.cc, 
   I noticed currently Arrow benchmark testing had not support generating 
random binary data.
   
   In my local server  end-2-end performance testing:
   
   `Generate one parquet file with dictionary-encoding (100M rows, 10 Cols)
   Run parquet with one thread on CentOS 7.6`
   
   
![image](https://user-images.githubusercontent.com/41657774/207017422-73345267-83ae-4745-b57c-14dba5a88b7a.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] zhixingheyi-tian commented on pull request #14353: ARROW-17735: [C++][Parquet] Optimize parquet reading for String/Binary type

Reply via email to