John,

200 GB is indeed a big amount of data as many machines have much less working 
memory albeit they may be able to save some virtual memory on disk if properly 
configured.

I have not used the tools you mention but wonder if your underlying data files 
change or are stable?

If they mostly do not change, one possible approach is to have your program 
save the list of filenames as well as the last changed date. When the program 
runs again, it can check for the existence of new files or absence of old files 
fairly quickly in memory. Perhaps it can then only deal with changed or new 
files and try to update their data to the DB carefully (avoiding duplicates) 
and thus speed up one part of your effort. Of course, this may not handle what 
to do with data that is gone or if your data allows redundant lines.

Ideally, whatever generates data should not be saving the files at all but deal 
directly with the DB. Or, if the CSV files contained a column specifying the 
date the data was added, you could use that to determine updates.

You said: 

" If all the data were in a few files, then in memory duckdb would work.  "

I wonder about that as 200GB is a huge amount. Do thousands of files take up 
more space than a few big ones?

Obviously, opening and closing lots of files is slower. Some operations do take 
a long time or use extra space so how you concatenate the data matters. On 
something like a LINUX system, there are fairly efficient ways to concatenate 
lots of files such as issuing a command within some folder that looks like:

   cat *.csv >subfolder/bigfile.csv

Again, if all share the same columns but have no header, this can run fairly 
quickly outside of R and then you read in bigfile.csv and later delete it. I 
note when not in use, files like that can be kept compressed and some methods 
even allow reading them in a compressed format, at some expense. If all you 
need is a random sample, I can imagine ways to read in less data.

And, ask yourself if the data in memory can be compressed in another way. As an 
example, R supports a data structure called factors for some kinds of data. If 
you have a column that stores something like the name of the US State, making 
it a factor as it is read in may require storing North Carolina as number 22 
and so on. TRUE/FALSE in a column might be replaced with a Boolean value of 
0/1.  There are other such techniques. This could be done carefully so the 
final in-memory data structure is smaller, and I noted earlier what happens if 
you simply remove many columns, and perhaps rows, as you read the data in, and 
remove temporary variables as soon as possible. Lots of people end up with many 
variations on their data remaining in memory and overwhelming any machine.

I am no expert, but perhaps any future such work would do better being designed 
up-front in ways that place the burden more on a data-base designed to handle 
large amounts of data rather than files on your machine. Going forward, many 
projects involve constantly adding data to the point where it has to be 
distributed across a cloud of machines in order to work at all.



-----Original Message-----
From: R-help <[email protected]> On Behalf Of Naresh Gurbuxani
Sent: Sunday, May 24, 2026 8:20 PM
To: John Kane <[email protected]>
Cc: [email protected]
Subject: Re: [R] duckdb table from multiple csv files

Files are on a local network drive.

I ended up creating a duckdb database and writing all the data into a couple of 
tables.  Database is approximately 200 GB.

Initially I was directly reading these files one at a time, doing the analysis, 
keeping analysis results, then moving to next file.  Going through all the 
files took a few hours.  Then, if I wanted to tweak the analysis, I needed to 
start over.

I am looking for tools to get faster access to data files, preferably without 
resaving data.  Some analysis requires a small subset of data.  If all the data 
were in a few files, then in memory duckdb would work.  There would be no need 
to resave data.  But with so many files, writing data into duckdb database was 
needed.

My analysis is mostly complete.  For next time, I want to see if arrow + duckdb 
will help avoid resaving data in another format.

Sent from my iPhone

On May 24, 2026, at 5:41 PM, John Kane <[email protected]> wrote:


I am not really sure what you are doing here.

Where are the files stored? Are they in one place?
What size are they?

On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani 
<[email protected]<mailto:[email protected]>> wrote:

I have approximately ten thousand csv files with identical columns and
formats.  I want to run some SQL queries on a virtual database, where all
of these files are treated as one table.  While it is possible to run
SQL query, dbListTables() does not show this table.  Is it possible to
list all tables including those created from arrow FileSystem?

Is it possible to achieve this result without arrow package?

# Create example data
library(data.table)
data("flights", package = "nycflights13")
fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv")
fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv")
fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv")

data("airports", package = "nycflights13")
fwrite(airports, "data/airports.csv")

# Verify data saved as intended
dir("data")
[1] "airports.csv" "flights"
dir("data/flights/")
[1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv"

# Create virtual database with two tables
library(arrow)
library(duckdb)

# csv file successfully registed as a table
con <- dbConnect(duckdb())
duckdb_read_csv(con, "airports", "data/airports.csv")
dbListTables(con)
[1] "airports"

# flights_arrow does not show up as a table
flights_arrow <- open_csv_dataset("data/flights")
duckdb_register_arrow(con, "flights", flights_arrow)
dbListTables(con)
[1] "airports"
dbGetQuery(con, "SELECT table_name FROM information_schema.tables;")
  table_name
1   airports

# SQL queries can be run on flights table
dbGetQuery(con, "SELECT * FROM flights LIMIT 2;")
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 
arr_delay carrier
1 2013     1   1      517            515         2      830            819      
  11      UA
2 2013     1   1      554            558        -4      740            728      
  12      UA
  flight tailnum origin dest air_time distance hour minute           time_hour
1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01 10:00:00
2   1696  N39463    EWR  ORD      150      719    5     58 2013-01-01 10:00:00

______________________________________________
[email protected]<mailto:[email protected]> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
John Kane
Kingston ON Canada

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to