Naresh,

There are times the choice of tools constrains you.

Consider what would happen if you chose a language with features readily and 
cheaply available that allow some kinds of concurrent programming. Python 
allows some of that in a somewhat interpreted way and plenty of more compiled 
languages may do it even more efficiently. I am talking about the ability to 
have a program hand out work to multiple other threads on the same machine or 
even distribute the work across a network of machines.

In your case, consider how some search engines do a search across a wide swath 
of the internet. Loosely, a master program sends requests to many others and 
gets back results and in some sense merges them. Another example is a merge 
sort where at every step, one function hands half of the remaining data to each 
of two others to sort  then waits for a result and merges the two returned 
streams together. This may happen using thousands of processes eventually and 
may return a fasgter result than many other sort methods.

In your case, you could have a queue of N simultaneous threads that each 
process one file for you and return results. As soon as one finishes, another 
thread starts to do the next file. The threads will likely interweave when they 
are doing slow things like I/O and may run on different processors and the 
interleaving can be substantially faster than one process running and spending 
much of the time waiting for I/O and so on. 

I don't know if R currently supports some of this well. I have seen Python used 
this way and many modules are available that support things like this. Of 
course, you might have a learning process involved as you might need to use 
add-on modules like numpy and pandas rather than what is largely built-in using 
R and its many packages. R does have some packages that may let you do this, 
such as "parallel" but, again, I suspect a learning curve to get it right.

I think I can sign off now as I suspect much of my thoughts may not be of help. 
😉

Avi


-----Original Message-----
From: Naresh Gurbuxani <[email protected]> 
Sent: Sunday, May 24, 2026 8:26 PM
To: [email protected]
Cc: John Kane <[email protected]>; [email protected]
Subject: Re: [R] duckdb table from multiple csv files

I have done most of the analysis using tools in base R: read one file, do 
analysis, keep results, move to next file.  Each cycle is several hours.  
Making small changes in analysis is costly.

I want to explore methods that will provide faster access to data, but do not 
require resaving data.


Sent from my iPhone

> On May 24, 2026, at 6:33 PM, [email protected] wrote:
> 
> Although there is some code there, we may want to step back and ask what the
> person wants to do.
> 
> Assuming you have thousands of files in CSV format and you can easily access
> them one by one, then there are two major methods that seem interesting if
> you want to evaluate them in some way.
> 
> You can open all the files and combine them into one entity you can query.
> 
> Alternatively, you can create a query that you then do repeatedly on as many
> (or all) of the files till you get a (combined) result.
> 
> The query can use native R mechanisms, or a package that does SQL if you put
> the data into a database, or another package such as in the tidyverse using
> dplyr and so on.
> 
> In real life, you have to ask if the many CSV files contain valid data
> without duplications or missing data and so on. If it ios not guaranteed to
> be exactly what you want, then consolidating the data and running multiple
> functions on it to make it what you want may be important. Removing (or
> maybe fixing) some rows may make sense. It may make sense to remove columns
> not needed for the query or rows that fail some condition that simplifies
> later queries. It may make sense to choose subsets, remove duplicates on the
> columns remaining and much more. As one of many examples, you might
> determine info about some column and then remove outliers above and below a
> threshold.
> 
> If you insist on using SQL, as the final query, you could push all the
> remaining data into a database.    
> 
> If your data is too large when combined, there are packages that can help
> you work with them albeit this is an example where pushing it all into a
> database first makes sense, albeit it can be slow.
> 
> And note, your solution can include work outside of R such as copying all
> the files (assuming n o header in each) into one file and then continuing
> from R.
> 
> Good luck.
> 
> 
> -----Original Message-----
> From: R-help <[email protected]> On Behalf Of John Kane
> Sent: Sunday, May 24, 2026 5:41 PM
> To: Naresh Gurbuxani <[email protected]>
> Cc: [email protected]
> Subject: Re: [R] duckdb table from multiple csv files
> 
> I am not really sure what you are doing here.
> 
> Where are the files stored? Are they in one place?
> What size are they?
> 
>> On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani
>> <[email protected]>
>> wrote:
>> 
>> 
>> I have approximately ten thousand csv files with identical columns and
>> formats.  I want to run some SQL queries on a virtual database, where all
>> of these files are treated as one table.  While it is possible to run
>> SQL query, dbListTables() does not show this table.  Is it possible to
>> list all tables including those created from arrow FileSystem?
>> 
>> Is it possible to achieve this result without arrow package?
>> 
>> # Create example data
>> library(data.table)
>> data("flights", package = "nycflights13")
>> fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv")
>> fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv")
>> fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv")
>> 
>> data("airports", package = "nycflights13")
>> fwrite(airports, "data/airports.csv")
>> 
>> # Verify data saved as intended
>> dir("data")
>> [1] "airports.csv" "flights"
>> dir("data/flights/")
>> [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv"
>> 
>> # Create virtual database with two tables
>> library(arrow)
>> library(duckdb)
>> 
>> # csv file successfully registed as a table
>> con <- dbConnect(duckdb())
>> duckdb_read_csv(con, "airports", "data/airports.csv")
>> dbListTables(con)
>> [1] "airports"
>> 
>> # flights_arrow does not show up as a table
>> flights_arrow <- open_csv_dataset("data/flights")
>> duckdb_register_arrow(con, "flights", flights_arrow)
>> dbListTables(con)
>> [1] "airports"
>> dbGetQuery(con, "SELECT table_name FROM information_schema.tables;")
>>  table_name
>> 1   airports
>> 
>> # SQL queries can be run on flights table
>> dbGetQuery(con, "SELECT * FROM flights LIMIT 2;")
>>  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
>> arr_delay carrier
>> 1 2013     1   1      517            515         2      830
>> 819        11      UA
>> 2 2013     1   1      554            558        -4      740
>> 728        12      UA
>>  flight tailnum origin dest air_time distance hour minute
>> time_hour
>> 1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01
>> 10:00:00
>> 2   1696  N39463    EWR  ORD      150      719    5     58 2013-01-01
>> 10:00:00
>> 
>> ______________________________________________
>> [email protected] mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> https://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
> 
> --
> John Kane
> Kingston ON Canada
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to