[ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906825#comment-16906825
 ] 

Wes McKinney commented on ARROW-6230:
-------------------------------------

For the record this file takes on the same order of magnitude as fst to load 
and convert to pandas without any tuning of data types (e.g. converting things 
to factor/categorical)

{code}
In [28]: %time table = pq.read_table('2016Q4.parquet')                          
                                                                                
CPU times: user 6.57 s, sys: 4.2 s, total: 10.8 s
Wall time: 2.05 s

In [29]: %time df = table.to_pandas()                                           
                                                                                
CPU times: user 2.37 s, sys: 2.11 s, total: 4.48 s
Wall time: 2.04 s
{code}

So the performance issue is probably R specific. I'll build the R package 
tomorrow and see if I can diagnose the problem

> [R] Reading in parquent files are 20x slower than reading fst files in R
> ------------------------------------------------------------------------
>
>                 Key: ARROW-6230
>                 URL: https://issues.apache.org/jira/browse/ARROW-6230
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>         Environment: Windows 10 Pro and Ubuntu 
>            Reporter: Zhuo Jia Dai
>            Priority: Major
>             Fix For: 0.14.1
>
>         Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to