Re: [R] duckdb table from multiple csv files

avi.e.gross Sun, 24 May 2026 15:33:49 -0700

Although there is some code there, we may want to step back and ask what the
person wants to do.

Assuming you have thousands of files in CSV format and you can easily access
them one by one, then there are two major methods that seem interesting if
you want to evaluate them in some way.

You can open all the files and combine them into one entity you can query.

Alternatively, you can create a query that you then do repeatedly on as many
(or all) of the files till you get a (combined) result.

The query can use native R mechanisms, or a package that does SQL if you put
the data into a database, or another package such as in the tidyverse using
dplyr and so on.

In real life, you have to ask if the many CSV files contain valid data
without duplications or missing data and so on. If it ios not guaranteed to
be exactly what you want, then consolidating the data and running multiple
functions on it to make it what you want may be important. Removing (or
maybe fixing) some rows may make sense. It may make sense to remove columns
not needed for the query or rows that fail some condition that simplifies
later queries. It may make sense to choose subsets, remove duplicates on the
columns remaining and much more. As one of many examples, you might
determine info about some column and then remove outliers above and below a
threshold.

If you insist on using SQL, as the final query, you could push all the
remaining data into a database.         

If your data is too large when combined, there are packages that can help
you work with them albeit this is an example where pushing it all into a
database first makes sense, albeit it can be slow.

And note, your solution can include work outside of R such as copying all
the files (assuming n o header in each) into one file and then continuing
from R.

Good luck.

-----Original Message-----
From: R-help <[email protected]> On Behalf Of John Kane
Sent: Sunday, May 24, 2026 5:41 PM
To: Naresh Gurbuxani <[email protected]>
Cc: [email protected]
Subject: Re: [R] duckdb table from multiple csv files

I am not really sure what you are doing here.

Where are the files stored? Are they in one place?
What size are they?

On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani
<[email protected]>
wrote:

>
> I have approximately ten thousand csv files with identical columns and
> formats.  I want to run some SQL queries on a virtual database, where all
> of these files are treated as one table.  While it is possible to run
> SQL query, dbListTables() does not show this table.  Is it possible to
> list all tables including those created from arrow FileSystem?
>
> Is it possible to achieve this result without arrow package?
>
> # Create example data
> library(data.table)
> data("flights", package = "nycflights13")
> fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv")
> fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv")
> fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv")
>
> data("airports", package = "nycflights13")
> fwrite(airports, "data/airports.csv")
>
> # Verify data saved as intended
> dir("data")
> [1] "airports.csv" "flights"
> dir("data/flights/")
> [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv"
>
> # Create virtual database with two tables
> library(arrow)
> library(duckdb)
>
> # csv file successfully registed as a table
> con <- dbConnect(duckdb())
> duckdb_read_csv(con, "airports", "data/airports.csv")
> dbListTables(con)
> [1] "airports"
>
> # flights_arrow does not show up as a table
> flights_arrow <- open_csv_dataset("data/flights")
> duckdb_register_arrow(con, "flights", flights_arrow)
> dbListTables(con)
> [1] "airports"
> dbGetQuery(con, "SELECT table_name FROM information_schema.tables;")
>   table_name
> 1   airports
>
> # SQL queries can be run on flights table
> dbGetQuery(con, "SELECT * FROM flights LIMIT 2;")
>   year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
> arr_delay carrier
> 1 2013     1   1      517            515         2      830
> 819        11      UA
> 2 2013     1   1      554            558        -4      740
> 728        12      UA
>   flight tailnum origin dest air_time distance hour minute
>  time_hour
> 1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01
> 10:00:00
> 2   1696  N39463    EWR  ORD      150      719    5     58 2013-01-01
> 10:00:00
>
> ______________________________________________
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
John Kane
Kingston ON Canada

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] duckdb table from multiple csv files

Reply via email to