[jira] [Commented] (ARROW-6774) [Rust] Reading parquet file is slow

Sietse Brouwer (Jira) Wed, 30 Sep 2020 15:16:20 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205061#comment-17205061
 ]


Sietse Brouwer commented on ARROW-6774:
---------------------------------------

I'm not sure what test data [~alippai]  used, so I used a test data set with 
500k rows and two columns:
 * a column x containing random floating point numbers,
 * and a column y where each cell contains a Unicode string of 100 
space-separated mostly-cyrillic words.

See the attached [^data.py]. When I saved that 500k-row table as parquet with 
gzip compression, the resulting file was 174 MB.

I tried running Adam's test snippet (the code I used is attached as [^main.rs]) 
while compiling with different versions of parquet:
 * parquet=0.15.1
 * parquet=1.0.1
 * parquet=2.0.0-SNAPSHOT (specifically git:3fae71b10c42 of 2020-09-30).

*In all three cases running the snippet took almost exactly 150 seconds,* give 
or take one second.

Does that help you decide whether to close the question, [~nevi_me]? Or perhaps 
your comment, Adam, from 2019-10-07 used some other version to get that speed 
improvement? Should I change the test to use the ParquetFileArrowReader example 
in 
[https://github.com/apache/arrow/blob/3fae71b10c42/rust/parquet/src/arrow/mod.rs#L25-L50,]
 and then this issue can close if that one is faster?

> [Rust] Reading parquet file is slow
> -----------------------------------
>
>                 Key: ARROW-6774
>                 URL: https://issues.apache.org/jira/browse/ARROW-6774
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>    Affects Versions: 0.15.0
>            Reporter: Adam Lippai
>            Priority: Major
>         Attachments: data.py, main.rs
>
>
> Using the example at 
> [https://github.com/apache/arrow/tree/master/rust/parquet] is slow.
> The snippet 
> {code:none}
> let reader = SerializedFileReader::new(file).unwrap();
> let mut iter = reader.get_row_iter(None).unwrap();
> let start = Instant::now();
> while let Some(record) = iter.next() {}
> let duration = start.elapsed();
> println!("{:?}", duration);
> {code}
> Runs for 17sec for a ~160MB parquet file.
> If there is a more effective way to load a parquet file, it would be nice to 
> add it to the readme.
> P.S.: My goal is to construct an ndarray from it, I'd be happy for any tips.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-6774) [Rust] Reading parquet file is slow

Reply via email to