[jira] [Comment Edited] (ARROW-6230) [R] Reading in parquent files are 20x slower than reading fst files in R

Wes McKinney (JIRA) Wed, 14 Aug 2019 08:37:29 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907359#comment-16907359
 ]


Wes McKinney edited comment on ARROW-6230 at 8/14/19 3:36 PM:
--------------------------------------------------------------

On the master branch I have

{code}
> a <- 
> data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt",
>  header=FALSE)
|--------------------------------------------------|
|==================================================|
> fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst")
> system.time(a <- 
> fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst"))
   user  system elapsed 
  8.174   2.866   2.969 
> system.time(a <- 
> arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet"))
   user  system elapsed 
  9.330   3.681   3.353 
{code}

This is on a true 16-core system.

This suggests that you performance problem is being caused by memory thrashing 
related to ARROW-6060 -- sorry about that, I would guess we'll have the 0.15.0 
release out with that fixed within 6 weeks. 

perf report suggests there is certainly some optimization opportunity. 

https://gist.github.com/wesm/7b577f0ce7dfdf96fddfd91943c162e5


was (Author: wesmckinn):
On the master branch I have

{code}
> a <- 
> data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt",
>  header=FALSE)
|--------------------------------------------------|
|==================================================|
> fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst")
> system.time(a <- 
> fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst"))
   user  system elapsed 
  8.174   2.866   2.969 
> system.time(a <- 
> arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet"))
   user  system elapsed 
  9.330   3.681   3.353 
{code}

This is on a true 16-core system.

This suggests that you performance problem is being caused by memory thrashing 
related to ARROW-6060 -- sorry about that, I would guess we'll have the 0.15.0 
release out with that fixed within 6 weeks. 

perf report suggests there is certainly some optimization opportunity. 

{code}
+   61.61%     0.00%  R        libc-2.27.so           [.] __clone
+   61.61%     0.00%  R        libpthread-2.27.so     [.] start_thread
+   61.61%     0.00%  R        libstdc++.so.6.0.26    [.] 
execute_native_thread_routine
+   61.61%     0.00%  R        libarrow.so.100.0.0    [.] 
std::thread::_State_impl<std::thread::_Invoker<std
+   56.48%     0.00%  R        libparquet.so.100.0.0  [.] 
std::__future_base::_Task_state<std::_Bind<parquet
+   56.48%     0.00%  R        libpthread-2.27.so     [.] __pthread_once_slow
+   56.48%     0.00%  R        libparquet.so.100.0.0  [.] 
std::__future_base::_State_baseV2::_M_do_set
+   56.48%     0.00%  R        libparquet.so.100.0.0  [.] 
std::_Function_handler<std::unique_ptr<std::__futu
+   56.48%     0.00%  R        libparquet.so.100.0.0  [.] 
parquet::arrow::FileReaderImpl::ReadSchemaField
+   56.47%     0.00%  R        libparquet.so.100.0.0  [.] 
parquet::arrow::LeafReader::NextBatch
+   38.83%     0.00%  R        libparquet.so.100.0.0  [.] 
parquet::internal::TypedRecordReader<parquet::Phys
+   37.68%     0.00%  R        libparquet.so.100.0.0  [.] 
parquet::internal::TypedRecordReader<parquet::Phys
+   34.85%     4.77%  R        libparquet.so.100.0.0  [.] 
parquet::DictByteArrayDecoderImpl::DecodeArrow<arr
+   34.85%     0.00%  R        libparquet.so.100.0.0  [.] 
parquet::internal::ByteArrayChunkedRecordReader::R
+   34.85%     0.00%  R        libparquet.so.100.0.0  [.] 
parquet::DictByteArrayDecoderImpl::DecodeArrow
+   34.65%     0.00%  R        [unknown]              [.] 0xffffffffffffffff
+   34.37%     0.03%  R        libR.so                [.] Rf_eval
+   34.19%     0.00%  R        libR.so                [.] 0x00007fa3538d099e
+   34.11%     0.00%  R        libR.so                [.] Rf_applyClosure
+   33.69%     0.00%  R        libR.so                [.] 0x00007fa3538c56e1
+   32.69%     0.00%  R        libR.so                [.] 0x00007fa3538c4e03
+   32.69%     0.00%  R        arrow.so               [.] 
_arrow_Table__to_dataframe
+   32.69%     0.00%  R        arrow.so               [.] Table__to_dataframe
+   32.69%     0.00%  R        arrow.so               [.] 
arrow::r::to_dataframe_parallel
+   18.64%     0.00%  R        arrow.so               [.] 
arrow::r::Converter::IngestSerial
+   18.46%     3.35%  R        arrow.so               [.] 
arrow::r::Converter_String::Ingest_some_nulls
+   16.90%     1.86%  R        libparquet.so.100.0.0  [.] 
arrow::internal::ChunkedBinaryBuilder::Append
+   14.40%     4.11%  R        libarrow.so.100.0.0    [.] 
arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen
+   14.16%     0.73%  R        libR.so                [.] Rf_allocVector3
+   12.40%     0.00%  R        libparquet.so.100.0.0  [.] 
parquet::internal::TypedRecordReader<parquet::Phys
+   12.39%     6.74%  R        libarrow.so.100.0.0    [.] 
arrow::BufferBuilder::Append
+   11.19%     1.83%  R        libparquet.so.100.0.0  [.] 
arrow::internal::ChunkedBinaryBuilder::AppendNull
+   10.13%     0.01%  R        libparquet.so.100.0.0  [.] 
parquet::internal::TypedRecordReader<parquet::Phys
+    9.89%     6.42%  R        libc-2.27.so           [.] 
__memmove_avx_unaligned_erms
+    9.21%     8.19%  R        libR.so                [.] Rf_mkCharLenCE
+    9.14%     5.09%  R        libarrow.so.100.0.0    [.] 
arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen
+    8.86%     8.86%  R        libparquet.so.100.0.0  [.] 
parquet::internal::DefinitionLevelsToBitmap
+    6.32%     6.32%  R        [unknown]              [k] 0xffffffffb5200a67
+    5.18%     4.61%  R        libparquet.so.100.0.0  [.] 
arrow::util::RleDecoder::GetBatchWithDictSpaced<do
+    5.12%     0.00%  R        libarrow.so.100.0.0    [.] 
arrow::internal::ThreadedTaskGroup::AppendReal(std
+    5.12%     0.00%  R        arrow.so               [.] 
std::_Function_handler<arrow::Status (), arrow::r:
+    5.08%     0.00%  R        arrow.so               [.] 
arrow::r::Converter_SimpleArray<14>::Allocate
+    5.02%     1.54%  R        libarrow.so.100.0.0    [.] 
arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen
{code}

> [R] Reading in parquent files are 20x slower than reading fst files in R
> ------------------------------------------------------------------------
>
>                 Key: ARROW-6230
>                 URL: https://issues.apache.org/jira/browse/ARROW-6230
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>         Environment: Windows 10 Pro and Ubuntu 
>            Reporter: Zhuo Jia Dai
>            Priority: Major
>             Fix For: 0.14.1
>
>         Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format 
> in R.
>  
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I 
> think it's best you register.
>  
> !image-2019-08-14-10-04-56-834.png!
>  
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
>  library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (ARROW-6230) [R] Reading in parquent files are 20x slower than reading fst files in R

Reply via email to