I have done most of the analysis using tools in base R: read one file, do analysis, keep results, move to next file. Each cycle is several hours. Making small changes in analysis is costly.
I want to explore methods that will provide faster access to data, but do not require resaving data. Sent from my iPhone > On May 24, 2026, at 6:33 PM, [email protected] wrote: > > Although there is some code there, we may want to step back and ask what the > person wants to do. > > Assuming you have thousands of files in CSV format and you can easily access > them one by one, then there are two major methods that seem interesting if > you want to evaluate them in some way. > > You can open all the files and combine them into one entity you can query. > > Alternatively, you can create a query that you then do repeatedly on as many > (or all) of the files till you get a (combined) result. > > The query can use native R mechanisms, or a package that does SQL if you put > the data into a database, or another package such as in the tidyverse using > dplyr and so on. > > In real life, you have to ask if the many CSV files contain valid data > without duplications or missing data and so on. If it ios not guaranteed to > be exactly what you want, then consolidating the data and running multiple > functions on it to make it what you want may be important. Removing (or > maybe fixing) some rows may make sense. It may make sense to remove columns > not needed for the query or rows that fail some condition that simplifies > later queries. It may make sense to choose subsets, remove duplicates on the > columns remaining and much more. As one of many examples, you might > determine info about some column and then remove outliers above and below a > threshold. > > If you insist on using SQL, as the final query, you could push all the > remaining data into a database. > > If your data is too large when combined, there are packages that can help > you work with them albeit this is an example where pushing it all into a > database first makes sense, albeit it can be slow. > > And note, your solution can include work outside of R such as copying all > the files (assuming n o header in each) into one file and then continuing > from R. > > Good luck. > > > -----Original Message----- > From: R-help <[email protected]> On Behalf Of John Kane > Sent: Sunday, May 24, 2026 5:41 PM > To: Naresh Gurbuxani <[email protected]> > Cc: [email protected] > Subject: Re: [R] duckdb table from multiple csv files > > I am not really sure what you are doing here. > > Where are the files stored? Are they in one place? > What size are they? > >> On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani >> <[email protected]> >> wrote: >> >> >> I have approximately ten thousand csv files with identical columns and >> formats. I want to run some SQL queries on a virtual database, where all >> of these files are treated as one table. While it is possible to run >> SQL query, dbListTables() does not show this table. Is it possible to >> list all tables including those created from arrow FileSystem? >> >> Is it possible to achieve this result without arrow package? >> >> # Create example data >> library(data.table) >> data("flights", package = "nycflights13") >> fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv") >> fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv") >> fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv") >> >> data("airports", package = "nycflights13") >> fwrite(airports, "data/airports.csv") >> >> # Verify data saved as intended >> dir("data") >> [1] "airports.csv" "flights" >> dir("data/flights/") >> [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv" >> >> # Create virtual database with two tables >> library(arrow) >> library(duckdb) >> >> # csv file successfully registed as a table >> con <- dbConnect(duckdb()) >> duckdb_read_csv(con, "airports", "data/airports.csv") >> dbListTables(con) >> [1] "airports" >> >> # flights_arrow does not show up as a table >> flights_arrow <- open_csv_dataset("data/flights") >> duckdb_register_arrow(con, "flights", flights_arrow) >> dbListTables(con) >> [1] "airports" >> dbGetQuery(con, "SELECT table_name FROM information_schema.tables;") >> table_name >> 1 airports >> >> # SQL queries can be run on flights table >> dbGetQuery(con, "SELECT * FROM flights LIMIT 2;") >> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time >> arr_delay carrier >> 1 2013 1 1 517 515 2 830 >> 819 11 UA >> 2 2013 1 1 554 558 -4 740 >> 728 12 UA >> flight tailnum origin dest air_time distance hour minute >> time_hour >> 1 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 >> 10:00:00 >> 2 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 >> 10:00:00 >> >> ______________________________________________ >> [email protected] mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> https://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > -- > John Kane > Kingston ON Canada > > [[alternative HTML version deleted]] > > ______________________________________________ > [email protected] mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > https://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ [email protected] mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide https://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

