[ https://issues.apache.org/jira/browse/ARROW-6230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907359#comment-16907359 ]
Wes McKinney commented on ARROW-6230: ------------------------------------- On the master branch I have {code} > a <- > data.table::fread("/home/wesm/data/fanniemae_loanperf/Performance_2016Q4.txt", > header=FALSE) |--------------------------------------------------| |==================================================| > fst::write_fst(a, "/home/wesm/data/fanniemae_loanperf/2016Q4.fst") > system.time(a <- > fst::read_fst("/home/wesm/data/fanniemae_loanperf/2016Q4.fst")) user system elapsed 8.174 2.866 2.969 > system.time(a <- > arrow::read_parquet("/home/wesm/data/fanniemae_loanperf/2016Q4.parquet")) user system elapsed 9.330 3.681 3.353 {code} This is on a true 16-core system. This suggests that you performance problem is being caused by memory thrashing related to ARROW-6060 -- sorry about that, I would guess we'll have the 0.15.0 release out with that fixed within 6 weeks. perf report suggests there is certainly some optimization opportunity. {code} + 61.61% 0.00% R libc-2.27.so [.] __clone + 61.61% 0.00% R libpthread-2.27.so [.] start_thread + 61.61% 0.00% R libstdc++.so.6.0.26 [.] execute_native_thread_routine + 61.61% 0.00% R libarrow.so.100.0.0 [.] std::thread::_State_impl<std::thread::_Invoker<std + 56.48% 0.00% R libparquet.so.100.0.0 [.] std::__future_base::_Task_state<std::_Bind<parquet + 56.48% 0.00% R libpthread-2.27.so [.] __pthread_once_slow + 56.48% 0.00% R libparquet.so.100.0.0 [.] std::__future_base::_State_baseV2::_M_do_set + 56.48% 0.00% R libparquet.so.100.0.0 [.] std::_Function_handler<std::unique_ptr<std::__futu + 56.48% 0.00% R libparquet.so.100.0.0 [.] parquet::arrow::FileReaderImpl::ReadSchemaField + 56.47% 0.00% R libparquet.so.100.0.0 [.] parquet::arrow::LeafReader::NextBatch + 38.83% 0.00% R libparquet.so.100.0.0 [.] parquet::internal::TypedRecordReader<parquet::Phys + 37.68% 0.00% R libparquet.so.100.0.0 [.] parquet::internal::TypedRecordReader<parquet::Phys + 34.85% 4.77% R libparquet.so.100.0.0 [.] parquet::DictByteArrayDecoderImpl::DecodeArrow<arr + 34.85% 0.00% R libparquet.so.100.0.0 [.] parquet::internal::ByteArrayChunkedRecordReader::R + 34.85% 0.00% R libparquet.so.100.0.0 [.] parquet::DictByteArrayDecoderImpl::DecodeArrow + 34.65% 0.00% R [unknown] [.] 0xffffffffffffffff + 34.37% 0.03% R libR.so [.] Rf_eval + 34.19% 0.00% R libR.so [.] 0x00007fa3538d099e + 34.11% 0.00% R libR.so [.] Rf_applyClosure + 33.69% 0.00% R libR.so [.] 0x00007fa3538c56e1 + 32.69% 0.00% R libR.so [.] 0x00007fa3538c4e03 + 32.69% 0.00% R arrow.so [.] _arrow_Table__to_dataframe + 32.69% 0.00% R arrow.so [.] Table__to_dataframe + 32.69% 0.00% R arrow.so [.] arrow::r::to_dataframe_parallel + 18.64% 0.00% R arrow.so [.] arrow::r::Converter::IngestSerial + 18.46% 3.35% R arrow.so [.] arrow::r::Converter_String::Ingest_some_nulls + 16.90% 1.86% R libparquet.so.100.0.0 [.] arrow::internal::ChunkedBinaryBuilder::Append + 14.40% 4.11% R libarrow.so.100.0.0 [.] arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen + 14.16% 0.73% R libR.so [.] Rf_allocVector3 + 12.40% 0.00% R libparquet.so.100.0.0 [.] parquet::internal::TypedRecordReader<parquet::Phys + 12.39% 6.74% R libarrow.so.100.0.0 [.] arrow::BufferBuilder::Append + 11.19% 1.83% R libparquet.so.100.0.0 [.] arrow::internal::ChunkedBinaryBuilder::AppendNull + 10.13% 0.01% R libparquet.so.100.0.0 [.] parquet::internal::TypedRecordReader<parquet::Phys + 9.89% 6.42% R libc-2.27.so [.] __memmove_avx_unaligned_erms + 9.21% 8.19% R libR.so [.] Rf_mkCharLenCE + 9.14% 5.09% R libarrow.so.100.0.0 [.] arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen + 8.86% 8.86% R libparquet.so.100.0.0 [.] parquet::internal::DefinitionLevelsToBitmap + 6.32% 6.32% R [unknown] [k] 0xffffffffb5200a67 + 5.18% 4.61% R libparquet.so.100.0.0 [.] arrow::util::RleDecoder::GetBatchWithDictSpaced<do + 5.12% 0.00% R libarrow.so.100.0.0 [.] arrow::internal::ThreadedTaskGroup::AppendReal(std + 5.12% 0.00% R arrow.so [.] std::_Function_handler<arrow::Status (), arrow::r: + 5.08% 0.00% R arrow.so [.] arrow::r::Converter_SimpleArray<14>::Allocate + 5.02% 1.54% R libarrow.so.100.0.0 [.] arrow::BaseBinaryBuilder<arrow::BinaryType>::Appen {code} > [R] Reading in parquent files are 20x slower than reading fst files in R > ------------------------------------------------------------------------ > > Key: ARROW-6230 > URL: https://issues.apache.org/jira/browse/ARROW-6230 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Environment: Windows 10 Pro and Ubuntu > Reporter: Zhuo Jia Dai > Priority: Major > Fix For: 0.14.1 > > Attachments: image-2019-08-14-10-04-56-834.png > > > *Problem* > Loading any of the data I mentioned below is 20x slower than the fst format > in R. > > *How to get the data* > [https://loanperformancedata.fanniemae.com/lppub/index.html] > Register and download any of these. I can't provide the data to you, and I > think it's best you register. > > !image-2019-08-14-10-04-56-834.png! > > *Code* > ```r > path = "data/Performance_2016Q4.txt" > library(data.table) > library(arrow) > a = data.table::fread(path, header = FALSE) > fst::write_fst(a, "data/a.fst") > arrow::write_parquet(a, "data/a.parquet") > rm(a); gc() > #read in test > system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds > rm(a); gc() > read in test > system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds > ``` -- This message was sent by Atlassian JIRA (v7.6.14#76016)