[ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906812#comment-16906812
 ] 

Wes McKinney commented on ARROW-6230:
-------------------------------------

Thanks for the example. I'm interested to see what the time is being spent. 
Reading Parquet files is quite fast in Python so I'll see what the performance 
is there also. 

There's some work going on for the current release (see ARROW-3772, ARROW-3325, 
ARROW-3246) that will enable direct writing of R factors to and from Parquet, 
so that could be a (no pun intended) factor in the results

> [R] Reading in parquent files are 20x slower than reading fst files in R
> ------------------------------------------------------------------------
>
>                 Key: ARROW-6230
>                 URL: https://issues.apache.org/jira/browse/ARROW-6230
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>         Environment: Windows 10 Pro and Ubuntu 
>            Reporter: Zhuo Jia Dai
>            Priority: Major
>             Fix For: 0.14.1
>
>         Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> path = "data/Performance_2016Q4.txt"
> library(data.table)
> library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> # read in test
> system.time(a <- fst::read_fst("data/a.fst"))
> # 4.61 seconds
> rm(a); gc()
> # read in test
> system.time(a <- arrow::read_parquet("data/a.parquet"))
> # 99.19 seconds



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to