On 01/12/2011 17:01, "Gabor Grothendieck" <ggrothendi...@gmail.com> wrote:
>On Thu, Dec 1, 2011 at 10:02 AM, Berry, David I. <d...@noc.ac.uk> wrote: >> Hi List >> >> Apologies if this isn't the correct place for this query (I've tried a >>search of the mail archives but not had much joy). >> >> I'm running R (2.14.0) on a Mac (OSX v 10.5.8, 2.66GHz, 4GB memory) >>and am having a few performance issues with reading data in from a >>Postres database (using RPostgreSQL). My query / code are as below >> >> # ----------------------------- >> library('RPostgreSQL') >> >> drv <- dbDriver("PostgreSQL") >> >> dbh <- dbConnect(drv,user="Š",password="Š",dbname="Š",host="Š") >> >> sql <- "select id, date, lon, lat, date_trunc('day' , date) as jday, >>extract('hour' from date) as hour, extract('year' from date) as year >>from observations where pt = 6 and date >= '1990-01-01' and date < >>'1995-01-01' and lon > 180 and lon < 290 and lat > -30 and lat < 30 and >>sst is not null" >> >> dataIn <- dbGetQuery(dbh,sql) > >If this is a large table of which the desired rows are a small >fraction of all rows then be sure there indexes on the variables in >your where clause. > >You can also try it with the RpgSQL driver although there is no reason >to think that that would be faster. > >-- >Statistics & Software Consulting >GKX Group, GKX Associates Inc. >tel: 1-877-GKX-GROUP >email: ggrothendieck at gmail.com Thanks for the reply and suggestions. I've tried the RpgSQL drivers and the results are pretty similar in terms of performance. The ~1.5M records I'm trying to read into R are being extracted from a table with ~300M rows (and ~60 columns) that has been indexed on the relevant columns and horizontally partitioned (with constraint checking on). I do need to try and optimize the database a bit more but I don¹t think this is the cause of the performance issues. As an example, when I run the query purely in R it takes 273s to run (using system.time() to time it). When I extract the data via psql and system() and then import it into R using read.table() it takes 32s. The code I've used for both are below. The second way of doing it (psql and read.table()) is less than ideal but does seem to have a big performance advantage at the moment the only difference in the results is that the date variables are stored as strings in the second example. # Query purely in R # ------------------------ dbh <- dbConnect(drv,user="Š",password="Š", dbname="Š",host="Š") sql <- "select id, date, lon, lat, date_trunc('day' , date) as jday, extract('hour' from date) as hour, extract('year' from date) as year from observations where pt = 6 and date >= '1990-01-01' and date < '1995-01-01' and lon > 180 and lon < 290 and lat > -30 and lat < 30 and sst is not null;" dataIn <- dbGetQuery(dbh,sql) # Query via command line # ---------------------------------- system('psql h myhost d mydb U myuid f getData.sql') system('cat tmp.csv | sed 's/^,/""&/g;s/^[0-9a-zA-Z]\+/"&"/g' > tmp2.csv') # This just ensures the first column is quoted dataIn <- read.table('tmp2.csv',sep=',' ,col.names=c( "id","date","lon","lat","jday","hour","year") ) # Contents of getData.sql # --------------------------------- \o ./tmp.csv \pset format unaligned \pset fieldsep ',' \pset tuples_only select id, date, lon, lat, date_trunc('day' , date) as jday, extract('hour' from date) as hour, extract('year' from date) as year from observations where pt = 6 and date >= '1990-01-01' and date < '1995-01-01' and lon > 180 and lon < 290 and lat > -30 and lat < 30 and sst is not null; \q ---------------------------------------------- David Berry National Oceanography Centre, UK -- This message (and any attachments) is for the recipient only. NERC is subject to the Freedom of Information Act 2000 and the contents of this email and any reply you make may be disclosed by NERC unless it is exempt from release under the Act. Any material supplied to NERC may be stored in an electronic records management system. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.