[GitHub] [arrow-rs] jiacai2050 opened a new issue, #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

GitBox Mon, 24 Oct 2022 01:12:41 -0700


jiacai2050 opened a new issue, #2916:
URL: https://github.com/apache/arrow-rs/issues/2916


   **Which part is this question about**
   API Usage & Perf
   
   **Describe your question**
   
   I create two benchmark based on [example 
code](https://docs.rs/parquet/latest/parquet/arrow/async_reader/index.html), 
and in my environment, this is what I got
   - ParquetRecordBatchReader cost 4s
   - ParquetRecordBatchStream cost 5s
   
   The tested data is:
   - total rows: 40935755
   - row group: 4998
   
   This is the schema of parquet file
   ```
   message arrow_schema {
     required int64 tsid (INTEGER(64,false));
     required int64 enddate (TIMESTAMP(MILLIS,false));
     optional int64 id;
     optional int64 code;
     optional binary source (STRING);
     optional int64 innercode;
     optional int64 del;
     optional int64 jsid;
     optional int64 updatetime (TIMESTAMP(MILLIS,false));
     optional double weight;
   }
   
   ```
   
   **Additional context**
    I dig into Parquet's source code, and find they both call 
`build_array_reader` to read parquet file, so the difference may above this 
layer.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] jiacai2050 opened a new issue, #2916: Perf about ParquetRecordBatchStream vs ParquetRecordBatchReader

Reply via email to