Hi Gabor,

Thank you for your willingness to help me through this.  The code you sent
works on my machine exactly the same way as it does on yours.
Unfortunately, when I run the same code on the 1.3GB file, it creates the
table structure but doesn't read in a single line [confirmed with
sqldf("select * from mytab",dbname="mydb")]  Though I don't expect anyone to
download it, the file I'm using is ss09pusa.csv from
http://www2.census.gov/acs2009_1yr/pums/csv_pus.zip.  I tested both sets of
code on my work desktop and personal laptop, so it's not machine-specific
(although it might be Windows- or 64 bit-specific).

Do you have any other ideas as to how I might diagnose what's going on
here?  Or, alternatively, is there some workaround that would get this giant
CSV into a database?  If you think there's a reasonable way to use the
IMPORT command with RSQLite, that seems like it would import the fastest,
but I don't know that it's compatible with DBI on Windows.

Thanks again!
Anthony


> read.csv.sql("R:\\American Community Survey\\Data\\2009\\ss09pusa.csv",
sql = "create table mytab as select * from file", dbname = "mydb")
NULL
Warning message:
closing unused connection 3 (R:\American Community
Survey\Data\2009\ss09pusa.csv)
>
> # how many records are in table?
> N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
>
> # read in chunks and display what we have read
>
> k <- 4 # no of records to read at once
> for(i in seq(0, N-1, k)) {
+        s <- sprintf("select * from mytab limit %d, %d", i, k)
+        print(sqldf(s, dbname = "mydb"))
+ }
Error in seq.default(0, N - 1, k) : wrong sign in 'by' argument
> N
[1] 0




On Mon, Nov 15, 2010 at 12:24 PM, Gabor Grothendieck <
ggrothendi...@gmail.com> wrote:

> On Mon, Nov 15, 2010 at 11:46 AM, Anthony Damico <ajdam...@gmail.com>
> wrote:
> > Hi Gabor,
> >
> > Thank you for the prompt reply.  I definitely looked over all of the
> > examples on the code.google.com sqldf page before sending, which is why
> I
> > wrote the code
> >
> > read.csv.sql("ss09pusa.csv" , sql="create table ss09pusa as select * from
> > file" , dbname="sqlite")
> >
> > directly pulled from their code -- read.csv.sql("~/tmp.csv", sql =
> "create
> > table mytab as select * from file",dbname = "mydb")
> >
> > ..but I don't understand why this helps me around the memory problem,
> since
> > I think it still all gets read into memory.  Is there a way to do this
> line
> > by line?
>
>
> OK. Maybe its something else.
>
> The reading in of the file into the database should not be a resource
> problem provided you have enough disk space and appropriate
> permissions. sqldf / RSQLite are used to get sqlite to do it so that
> the data never goes through R at that stage so R limitations can't
> affect the reading in to the sqlite database.   When you read it from
> the sqlite database then R limitations come into effect so you just
> have to be sure not to read too much in at a time.  The use of create
> table ... as select ... is to prevent sqldf from deleting the table
> since sqldf is normally used in a fashion where you don't want to know
> about the back end databases so it tries to create them and delete
> them behind the scenes but here you want to explicitly use them so
> you have to work around that.
>
> Try this example. It should be reproducible so you just have to copy
> it and paste it into your R session.  Uncomment the indicated line if
> you want to be able to remove any pre-existing mydb file in the
> current directory.  Try it in a fresh R session just to be sure that
> nothing mucks it up.
>
> library(sqldf)
>
> # uncomment next line to make sure we are starting clean
> # if (file.exists("mydb")) file.remove("mydb")
>
> # create new database
> sqldf("attach 'mydb' as new")
>
> # create a new file.  BOD is built into R and has 6 rows.
> write.table(BOD, file = "tmp.csv", quote = FALSE, sep = ",")
>
> # read new file into database
> read.csv.sql("tmp.csv", sql = "create table mytab as select * from file",
>        dbname = "mydb")
>
> # how many records are in table?
> N <- sqldf("select count(*) from mytab", dbname = "mydb")[[1]]
>
> # read in chunks and display what we have read
>
> k <- 4 # no of records to read at once
> for(i in seq(0, N-1, k)) {
>        s <- sprintf("select * from mytab limit %d, %d", i, k)
>        print(sqldf(s, dbname = "mydb"))
> }
>
> On my machine I get this output:
>
>  Time demand
> 1    1    8.3
> 2    2   10.3
> 3    3   19.0
> 4    4   16.0
>  Time demand
> 1    5   15.6
> 2    7   19.8
>
> showing that it read the 6 line BOD data frame in chunks of 4 as required.
> --
> Statistics & Software Consulting
> GKX Group, GKX Associates Inc.
> tel: 1-877-GKX-GROUP
> email: ggrothendieck at gmail.com
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to