Re: [R] How to Read a Large CSV into a Database with R

Anthony Damico Mon, 15 Nov 2010 10:31:34 -0800

Hi Gabor,

Thank you for your willingness to help me through this.  The code you sent
works on my machine exactly the same way as it does on yours.
Unfortunately, when I run the same code on the 1.3GB file, it creates the
table structure but doesn't read in a single line [confirmed with
sqldf("select * from mytab",dbname="mydb")]  Though I don't expect anyone to
download it, the file I'm using is ss09pusa.csv from
http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip.  I tested both sets of
code on my work desktop and personal laptop, so it's not machine-specific
(although it might be Windows- or 64 bit-specific).


Do you have any other ideas as to how I might diagnose what's going on
here?  Or, alternatively, is there some workaround that would get this giant
CSV into a database?  If you think there's a reasonable way to use the
IMPORT command with RSQLite, that seems like it would import the fastest,
but I don't know that it's compatible with DBI on Windows.

Thanks again!
Anthony


> read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv",
sql = "create table mytab as select * from file", dbname = "mydb")
NULL
Warning message:
closing unused connection 3 (R:\American Community
Survey\Data\2009\ss09pusa.csv)
>
> # how many records are in table?
> N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
>
> # read in chunks and display what we have read
>
> k <- 4 # no of records to read at once
> for(i in seq(0, N-1, k)) {
+        s <- sprintf("select * from mytab limit %d, %d", i, k)
+        print(sqldf(s, dbname = "mydb"))
+ }
Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument
> N
[1] 0




On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck <
ggrothendi...@gmail.com> wrote:

> On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico <ajdam...@gmail.com>
> wrote:
> > Hi Gabor,
> >
> > Thank you for the prompt reply.  I definitely looked over all of the
> > examples on the code.google.com sqldf page before sending, which is why
> I
> > wrote the code
> >
> > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> > file" , dbname="sqlite")
> >
> > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql =
> "create
> > table mytab as select * from file",dbname = "mydb")
> >
> > ..but I don't understand why this helps me around the memory problem,
> since
> > I think it still all gets read into memory.  Is there a way to do this
> line
> > by line?
>
>
> OK. Maybe its something else.
>
> The reading in of the file into the database should not be a resource
> problem provided you have enough disk space and appropriate
> permissions. sqldf / RSQLite are used to get sqlite to do it so that
> the data never goes through R at that stage so R limitations can't
> affect the reading in to the sqlite database.   When you read it from
> the sqlite database then R limitations come into effect so you just
> have to be sure not to read too much in at a time.  The use of create
> table ... as select ... is to prevent sqldf from deleting the table
> since sqldf is normally used in a fashion where you don't want to know
> about the back end databases so it tries to create them and delete
> them behind the scenes but here you want to explicitly use them so
> you have to work around that.
>
> Try this example. It should be reproducible so you just have to copy
> it and paste it into your R session.  Uncomment the indicated line if
> you want to be able to remove any pre-existing mydb file in the
> current directory.  Try it in a fresh R session just to be sure that
> nothing mucks it up.
>
> library(sqldf)
>
> # uncomment next line to make sure we are starting clean
> # if (file.exists("mydb")) file.remove("mydb")
>
> # create new database
> sqldf("attach 'mydb' as new")
>
> # create a new file.  BOD is built into R and has 6 rows.
> write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",")
>
> # read new file into database
> read.csv.sql("tmp.csv", sql = "create table mytab as select * from file",
>        dbname = "mydb")
>
> # how many records are in table?
> N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
>
> # read in chunks and display what we have read
>
> k <- 4 # no of records to read at once
> for(i in seq(0, N-1, k)) {
>        s <- sprintf("select * from mytab limit %d, %d", i, k)
>        print(sqldf(s, dbname = "mydb"))
> }
>
> On my machine I get this output:
>
>  Time demand
> 1    1    8.3
> 2    2   10.3
> 3    3   19.0
> 4    4   16.0
>  Time demand
> 1    5   15.6
> 2    7   19.8
>
> showing that it read the 6 line BOD data frame in chunks of 4 as required.
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to Read a Large CSV into a Database with R

Reply via email to