Hey Igor, I have been dealing with *CSV*/*XLSX* files from time to time and depending on the size of those files you are mentioning, 180 seconds isn't really that much. >From my experience, *vroom *is the fastest I've encountered but it deals with *CSV* files (I can support its usage for up to 8gb size files). If you have the option to import CSVs instead, you should give it a try.
Other than that there are several other factors to be considered, such as memory and disk read/write capabilities. And again, it is just 180 seconds so just suggest the user go get a cup of coffee :) Best, Diego On Wed, 5 Oct 2022 at 10:02, Igor L <igorlal...@gmail.com> wrote: > According to my internet research, it looks like readxl is the fastest > package. > > The profvis package indicated that the bottleneck is indeed in importing > the files. > > My processor has six cores, but when I use four of them the computer > crashes completely. When I use three processors, it's still usable. So I > did one more benchmark comparing for loop, map_dfr and future_map_dfr (with > multisession and three cores). > > After the benchmark was run 10 times, the result was: > > expr min lq mean > median uq max neval > import_for() 140.9940 147.9722 160.7229 155.6459 172.4661 > 199.1059 10 > import_map_dfr() 161.6707 339.6769 480.5760 567.8389 643.8895 666.0726 > 10 > import_furrr() 112.1374 116.4301 127.5976 129.0067 137.9179 > 140.8632 10 > > For me it is proven that using the furrr package is the best solution in > this case, but what would explain so much difference with map_dfr? > > Em ter., 4 de out. de 2022 às 16:58, Jeff Newmiller < > jdnew...@dcn.davis.ca.us> escreveu: > > > It looks like you are reading directly from URLs? How do you know the > > delay is not network I/O delay? > > > > Parallel computation is not a panacea. It allows tasks _that are > > CPU-bound_ to get through the CPU-intensive work faster. You need to be > > certain that your tasks actually can benefit from parallelism before > using > > it... there is a significant overhead and added complexity to using > > parallel processing that will lead to SLOWER processing if mis-used. > > > > On October 4, 2022 11:29:54 AM PDT, Igor L <igorlal...@gmail.com> wrote: > > >Hello all, > > > > > >I'm developing an R package that basically downloads, imports, cleans > and > > >merges nine files in xlsx format updated monthly from a public > > institution. > > > > > >The problem is that importing files in xlsx format is time consuming. > > > > > >My initial idea was to parallelize the execution of the read_xlsx > function > > >according to the number of cores in the user's processor, but apparently > > it > > >didn't make much difference, since when trying to parallelize it the > > >execution time went from 185.89 to 184.12 seconds: > > > > > ># not parallelized code > > >y <- purrr::map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > > readxl::read_excel, sheet = 1, skip = 4, col_types = > > >c(rep('text', 30))) > > > > > ># parallelized code > > >plan(strategy = future::multicore(workers = 4)) > > >y <- furrr::future_map_dfr(paste0(dir.temp, '/', lista.arquivos.locais), > > > readxl::read_excel, sheet = 1, skip = 4, > > >col_types = c(rep('text', 30))) > > > > > > Any suggestions to reduce the import processing time? > > > > > >Thanks in advance! > > > > > > > -- > > Sent from my phone. Please excuse my brevity. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel > [[alternative HTML version deleted]] ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel