[
https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907359#comment-16907359
]
Wes McKinney edited comment on ARROW-6230 at 8/14/19 3:36 PM:
--------------------------------------------------------------
On the master branch I have
{code}
> a <-
> data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt",
> header=FALSE)
|--------------------------------------------------|
|==================================================|
> fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst")
> system.time(a <-
> fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst"))
user system elapsed
8.174 2.866 2.969
> system.time(a <-
> arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet"))
user system elapsed
9.330 3.681 3.353
{code}
This is on a true 16-core system.
This suggests that you performance problem is being caused by memory thrashing
related to ARROW-6060 -- sorry about that, I would guess we'll have the 0.15.0
release out with that fixed within 6 weeks.
perf report suggests there is certainly some optimization opportunity.
https://gist.github.com/wesm/7b577f0ce7dfdf96fddfd91943c162e5
was (Author: wesmckinn):
On the master branch I have
{code}
> a <-
> data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt",
> header=FALSE)
|--------------------------------------------------|
|==================================================|
> fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst")
> system.time(a <-
> fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst"))
user system elapsed
8.174 2.866 2.969
> system.time(a <-
> arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet"))
user system elapsed
9.330 3.681 3.353
{code}
This is on a true 16-core system.
This suggests that you performance problem is being caused by memory thrashing
related to ARROW-6060 -- sorry about that, I would guess we'll have the 0.15.0
release out with that fixed within 6 weeks.
perf report suggests there is certainly some optimization opportunity.
{code}
+ 61.61% 0.00% R libc-2.27.so [.] __clone
+ 61.61% 0.00% R libpthread-2.27.so [.] start_thread
+ 61.61% 0.00% R libstdc++.so.6.0.26 [.]
execute_native_thread_routine
+ 61.61% 0.00% R libarrow.so.100.0.0 [.]
std::thread::_State_impl<std::thread::_Invoker<std
+ 56.48% 0.00% R libparquet.so.100.0.0 [.]
std::__future_base::_Task_state<std::_Bind<parquet
+ 56.48% 0.00% R libpthread-2.27.so [.] __pthread_once_slow
+ 56.48% 0.00% R libparquet.so.100.0.0 [.]
std::__future_base::_State_baseV2::_M_do_set
+ 56.48% 0.00% R libparquet.so.100.0.0 [.]
std::_Function_handler<std::unique_ptr<std::__futu
+ 56.48% 0.00% R libparquet.so.100.0.0 [.]
parquet::arrow::FileReaderImpl::ReadSchemaField
+ 56.47% 0.00% R libparquet.so.100.0.0 [.]
parquet::arrow::LeafReader::NextBatch
+ 38.83% 0.00% R libparquet.so.100.0.0 [.]
parquet::internal::TypedRecordReader<parquet::Phys
+ 37.68% 0.00% R libparquet.so.100.0.0 [.]
parquet::internal::TypedRecordReader<parquet::Phys
+ 34.85% 4.77% R libparquet.so.100.0.0 [.]
parquet::DictByteArrayDecoderImpl::DecodeArrow<arr
+ 34.85% 0.00% R libparquet.so.100.0.0 [.]
parquet::internal::ByteArrayChunkedRecordReader::R
+ 34.85% 0.00% R libparquet.so.100.0.0 [.]
parquet::DictByteArrayDecoderImpl::DecodeArrow
+ 34.65% 0.00% R [unknown] [.] 0xffffffffffffffff
+ 34.37% 0.03% R libR.so [.] Rf_eval
+ 34.19% 0.00% R libR.so [.] 0x00007fa3538d099e
+ 34.11% 0.00% R libR.so [.] Rf_applyClosure
+ 33.69% 0.00% R libR.so [.] 0x00007fa3538c56e1
+ 32.69% 0.00% R libR.so [.] 0x00007fa3538c4e03
+ 32.69% 0.00% R arrow.so [.]
_arrow_Table__to_dataframe
+ 32.69% 0.00% R arrow.so [.] Table__to_dataframe
+ 32.69% 0.00% R arrow.so [.]
arrow::r::to_dataframe_parallel
+ 18.64% 0.00% R arrow.so [.]
arrow::r::Converter::IngestSerial
+ 18.46% 3.35% R arrow.so [.]
arrow::r::Converter_String::Ingest_some_nulls
+ 16.90% 1.86% R libparquet.so.100.0.0 [.]
arrow::internal::ChunkedBinaryBuilder::Append
+ 14.40% 4.11% R libarrow.so.100.0.0 [.]
arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen
+ 14.16% 0.73% R libR.so [.] Rf_allocVector3
+ 12.40% 0.00% R libparquet.so.100.0.0 [.]
parquet::internal::TypedRecordReader<parquet::Phys
+ 12.39% 6.74% R libarrow.so.100.0.0 [.]
arrow::BufferBuilder::Append
+ 11.19% 1.83% R libparquet.so.100.0.0 [.]
arrow::internal::ChunkedBinaryBuilder::AppendNull
+ 10.13% 0.01% R libparquet.so.100.0.0 [.]
parquet::internal::TypedRecordReader<parquet::Phys
+ 9.89% 6.42% R libc-2.27.so [.]
__memmove_avx_unaligned_erms
+ 9.21% 8.19% R libR.so [.] Rf_mkCharLenCE
+ 9.14% 5.09% R libarrow.so.100.0.0 [.]
arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen
+ 8.86% 8.86% R libparquet.so.100.0.0 [.]
parquet::internal::DefinitionLevelsToBitmap
+ 6.32% 6.32% R [unknown] [k] 0xffffffffb5200a67
+ 5.18% 4.61% R libparquet.so.100.0.0 [.]
arrow::util::RleDecoder::GetBatchWithDictSpaced<do
+ 5.12% 0.00% R libarrow.so.100.0.0 [.]
arrow::internal::ThreadedTaskGroup::AppendReal(std
+ 5.12% 0.00% R arrow.so [.]
std::_Function_handler<arrow::Status (), arrow::r:
+ 5.08% 0.00% R arrow.so [.]
arrow::r::Converter_SimpleArray<14>::Allocate
+ 5.02% 1.54% R libarrow.so.100.0.0 [.]
arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen
{code}
> [R] Reading in parquent files are 20x slower than reading fst files in R
> ------------------------------------------------------------------------
>
> Key: ARROW-6230
> URL: https://issues.apache.org/jira/browse/ARROW-6230
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Environment: Windows 10 Pro and Ubuntu
> Reporter: Zhuo Jia Dai
> Priority: Major
> Fix For: 0.14.1
>
> Attachments: image-2019-08-14-10-04-56-834.png
>
>
> *Problem*
> Loading any of the data I mentioned below is 20x slower than the fst format
> in R.
>
> *How to get the data*
> [https://loanperformancedata.fanniemae.com/lppub/index.html]
> Register and download any of these. I can't provide the data to you, and I
> think it's best you register.
>
> !image-2019-08-14-10-04-56-834.png!
>
> *Code*
> ```r
> path = "data/Performance_2016Q4.txt"
> library(data.table)
> library(arrow)
> a = data.table::fread(path, header = FALSE)
> fst::write_fst(a, "data/a.fst")
> arrow::write_parquet(a, "data/a.parquet")
> rm(a); gc()
> #read in test
> system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds
> rm(a); gc()
> read in test
> system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
> ```
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)