Sent from my iPhone

> On May 24, 2026, at 9:52 PM, [email protected] wrote:
> 
> John,
> 
> 200 GB is indeed a big amount of data as many machines have much less working 
> memory albeit they may be able to save some virtual memory on disk if 
> properly configured.
> 
> I have not used the tools you mention but wonder if your underlying data 
> files change or are stable?
> 
> If they mostly do not change, one possible approach is to have your program 
> save the list of filenames as well as the last changed date. When the program 
> runs again, it can check for the existence of new files or absence of old 
> files fairly quickly in memory. Perhaps it can then only deal with changed or 
> new files and try to update their data to the DB carefully (avoiding 
> duplicates) and thus speed up one part of your effort. Of course, this may 
> not handle what to do with data that is gone or if your data allows redundant 
> lines.
> 
> Ideally, whatever generates data should not be saving the files at all but 
> deal directly with the DB. Or, if the CSV files contained a column specifying 
> the date the data was added, you could use that to determine updates.
> 
> You said:
> 
> " If all the data were in a few files, then in memory duckdb would work.  "
> 
I only need a subset of data at any time.  Duckdb allows a virtual table for 
each file.  This not practical with thousands of files.  With a few large 
files, this can work.  Here the goal is to establish a connection, not to load 
all data at once.  

With arrow, it is possible to designate a directory as virtual table.  
> I wonder about that as 200GB is a huge amount. Do thousands of files take up 
> more space than a few big ones?
> 
> Obviously, opening and closing lots of files is slower. Some operations do 
> take a long time or use extra space so how you concatenate the data matters. 
> On something like a LINUX system, there are fairly efficient ways to 
> concatenate lots of files such as issuing a command within some folder that 
> looks like:
> 
>   cat *.csv >subfolder/bigfile.csv
> 
> Again, if all share the same columns but have no header, this can run fairly 
> quickly outside of R and then you read in bigfile.csv and later delete it. I 
> note when not in use, files like that can be kept compressed and some methods 
> even allow reading them in a compressed format, at some expense. If all you 
> need is a random sample, I can imagine ways to read in less data.
> 
> And, ask yourself if the data in memory can be compressed in another way. As 
> an example, R supports a data structure called factors for some kinds of 
> data. If you have a column that stores something like the name of the US 
> State, making it a factor as it is read in may require storing North Carolina 
> as number 22 and so on. TRUE/FALSE in a column might be replaced with a 
> Boolean value of 0/1.  There are other such techniques. This could be done 
> carefully so the final in-memory data structure is smaller, and I noted 
> earlier what happens if you simply remove many columns, and perhaps rows, as 
> you read the data in, and remove temporary variables as soon as possible. 
> Lots of people end up with many variations on their data remaining in memory 
> and overwhelming any machine.
> 
> I am no expert, but perhaps any future such work would do better being 
> designed up-front in ways that place the burden more on a data-base designed 
> to handle large amounts of data rather than files on your machine. Going 
> forward, many projects involve constantly adding data to the point where it 
> has to be distributed across a cloud of machines in order to work at all.
> 
> 
> 
> -----Original Message-----
> From: R-help <[email protected]> On Behalf Of Naresh Gurbuxani
> Sent: Sunday, May 24, 2026 8:20 PM
> To: John Kane <[email protected]>
> Cc: [email protected]
> Subject: Re: [R] duckdb table from multiple csv files
> 
> Files are on a local network drive.
> 
> I ended up creating a duckdb database and writing all the data into a couple 
> of tables.  Database is approximately 200 GB.
> 
> Initially I was directly reading these files one at a time, doing the 
> analysis, keeping analysis results, then moving to next file.  Going through 
> all the files took a few hours.  Then, if I wanted to tweak the analysis, I 
> needed to start over.
> 
> I am looking for tools to get faster access to data files, preferably without 
> resaving data.  Some analysis requires a small subset of data.  If all the 
> data were in a few files, then in memory duckdb would work.  There would be 
> no need to resave data.  But with so many files, writing data into duckdb 
> database was needed.
> 
> My analysis is mostly complete.  For next time, I want to see if arrow + 
> duckdb will help avoid resaving data in another format.
> 
> Sent from my iPhone
> 
> On May 24, 2026, at 5:41 PM, John Kane <[email protected]> wrote:
> 
> 
> I am not really sure what you are doing here.
> 
> Where are the files stored? Are they in one place?
> What size are they?
> 
> On Sun, 24 May 2026 at 09:35, Naresh Gurbuxani 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> I have approximately ten thousand csv files with identical columns and
> formats.  I want to run some SQL queries on a virtual database, where all
> of these files are treated as one table.  While it is possible to run
> SQL query, dbListTables() does not show this table.  Is it possible to
> list all tables including those created from arrow FileSystem?
> 
> Is it possible to achieve this result without arrow package?
> 
> # Create example data
> library(data.table)
> data("flights", package = "nycflights13")
> fwrite(flights[(origin == "EWR")], "data/flights/ewr_flights.csv")
> fwrite(flights[(origin == "JFK")], "data/flights/jfk_flights.csv")
> fwrite(flights[(origin == "LGA")], "data/flights/lga_flights.csv")
> 
> data("airports", package = "nycflights13")
> fwrite(airports, "data/airports.csv")
> 
> # Verify data saved as intended
> dir("data")
> [1] "airports.csv" "flights"
> dir("data/flights/")
> [1] "ewr_flights.csv" "jfk_flights.csv" "lga_flights.csv"
> 
> # Create virtual database with two tables
> library(arrow)
> library(duckdb)
> 
> # csv file successfully registed as a table
> con <- dbConnect(duckdb())
> duckdb_read_csv(con, "airports", "data/airports.csv")
> dbListTables(con)
> [1] "airports"
> 
> # flights_arrow does not show up as a table
> flights_arrow <- open_csv_dataset("data/flights")
> duckdb_register_arrow(con, "flights", flights_arrow)
> dbListTables(con)
> [1] "airports"
> dbGetQuery(con, "SELECT table_name FROM information_schema.tables;")
>  table_name
> 1   airports
> 
> # SQL queries can be run on flights table
> dbGetQuery(con, "SELECT * FROM flights LIMIT 2;")
>  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time 
> arr_delay carrier
> 1 2013     1   1      517            515         2      830            819    
>     11      UA
> 2 2013     1   1      554            558        -4      740            728    
>     12      UA
>  flight tailnum origin dest air_time distance hour minute           time_hour
> 1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01 10:00:00
> 2   1696  N39463    EWR  ORD      150      719    5     58 2013-01-01 10:00:00
> 
> ______________________________________________
> [email protected]<mailto:[email protected]> mailing list -- To 
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> --
> John Kane
> Kingston ON Canada
> 
>    [[alternative HTML version deleted]]
> 
> ______________________________________________
> [email protected] mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to