[
https://issues.apache.org/jira/browse/ARROW-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17206895#comment-17206895
]
Sietse Brouwer commented on ARROW-6774:
---------------------------------------
[~alippai], I can't get parquet::arrow::ParquetFileArrowReader to be faster
than parquet::file::reader::SerializedFileReader under commit `3fae71b10c42`.
Timings below, code below that, conclusions at the bottom. Interesting times in
*bold.*
||n_rows||include utf8-column||reader||iteration unit
_(loop does not iterate over rows within batches)_||time taken||
|50_000|yes|ParquetFileArrowReader|1 batch of 50k rows|14.9s|
|50_000|yes|ParquetFileArrowReader|10 batches of 5k rows|14.8s|
|50_000|yes|ParquetFileArrowReader|50k batches of 1 row|24.0s|
|50_000|yes|SerializedFileReader|get_row_iter|*14.5s*|
| | | | | |
|50_000|no|ParquetFileArrowReader|1 batch of 50k rows|*143ms*|
|50_000|no|ParquetFileArrowReader|10 batches of 5k rows|154ms|
|50_000|no|ParquetFileArrowReader|50k batches of 1 row|6.5s|
|50_000|no|SerializedFileReader| get_row_iter|*211ms*|
Here is the code I used to load the dataset with ParquetFileArrowReader (see
also this version of [^main.rs]):
{code:java}
fn read_with_arrow(file: File) -> () {
let file_reader = SerializedFileReader::new(file).unwrap();
let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(file_reader));
println!("Arrow schema is: {}", arrow_reader.get_schema().unwrap());
let mut record_batch_reader = arrow_reader
.get_record_reader(/* batch size */ 50000)
.unwrap();
let start = Instant::now();
while let Some(_record) = record_batch_reader.next_batch().unwrap() {
// no-op
};
let duration = start.elapsed();
println!("{:?}", duration);
}
{code}
Main observations:
* we can't tell whether the slow loading when we include the UTF8 column is
because UTF8 is slow to process, or because the column is very big (100 random
Russian words per cell).
* When the big UTF-8 column is included, iterating over every row with
SerializedFileReader is as fast as iterating over a few batches with
ParquetFileArrowReader. Even when you skip the rows within the batches!
* Should I try this again with (size 10k row * 3k Float64 columns) plus one
small UTF-8 column?
* I'm not even sure what result I'm trying to reproduce-or-falsify here...
whether adding a small UTF-8 column causes disproportional slowdown? Or whether
switching between SerializedFileReader and ParquetFileArrowReader causes
slowdown? Right now, I feel like everything and nothing is in scope of the
issue. I wouldn't mind if somebody made it narrower and clearer.
> [Rust] Reading parquet file is slow
> -----------------------------------
>
> Key: ARROW-6774
> URL: https://issues.apache.org/jira/browse/ARROW-6774
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust
> Affects Versions: 0.15.0
> Reporter: Adam Lippai
> Priority: Major
> Attachments: data.py, main.rs, main.rs
>
>
> Using the example at
> [https://github.com/apache/arrow/tree/master/rust/parquet] is slow.
> The snippet
> {code:none}
> let reader = SerializedFileReader::new(file).unwrap();
> let mut iter = reader.get_row_iter(None).unwrap();
> let start = Instant::now();
> while let Some(record) = iter.next() {}
> let duration = start.elapsed();
> println!("{:?}", duration);
> {code}
> Runs for 17sec for a ~160MB parquet file.
> If there is a more effective way to load a parquet file, it would be nice to
> add it to the readme.
> P.S.: My goal is to construct an ndarray from it, I'd be happy for any tips.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)