Richard Vlasimsky schrieb:
Does anyone have any performance tuning tips when working with datasets that 
are extremely wide (e.g. 20,000 columns)?

In particular, I am trying to perform a merge like below:

merged_data <- merge(data1, data2, by.x=ate",by.y="date",all=TRUE,sort=TRUE);

This statement takes about 8 hours to execute on a pretty fast machine. The dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 columns. The dataset data2 contains annual data (only 60 observations), however there are lots of columns (20,000 of them). I have to do a lot of these kinds of merges so need to figure out a way to speed it up.
I have tried  a number of different things to speed things up to no avail.  
I've noticed that rbinds execute much faster using matrices than dataframes.  
However the performance improvement when using matrices (vs. data frames) on 
merges were negligible (8 hours down to 7).  I tried casting my merge field 
(date) into various different data types (character, factor, date).  This 
didn't seem to have any effect. I tried the hash package, however, merge 
couldn't coerce the class into a data.frame.  I've tried various ways to 
parellelize computation in the past, and found that to be problematic for a 
variety of reasons (runaway forked processes, doesn't run in a GUI environment, 
doesn't run on Macs, etc.).

I'm starting to run out of ideas, anyone?  Merging a 60 row dataset shouldn't 
take that long.

Thanks,
Richard


Hi Richard,

I had similar problems (even with much less data) and found out that most of the running time was caused by memory swapping instead of CPU usage. If you do not need all of the merged data at once, block-wise processing can help, which means that you only generate that much merged data at once as fits into main memory. I ended up using package RSQLite (an embedded database) in the following way:

-create a database connection (explained in the package docs)
-copy data to database tables via dbWriteTable()
-create indices on the columns which are used for merging, sth. like: "dbGetQuery(con, 'create index index_year on table2(year)')" -> this speeds up joining significantly -construct an SQL query to do the join / merge operation and send it to SQLite via dbSendQuery()
-retreive the result in blocks of reasonable size with fetch()

Unless there is an operation in the query which requires SQLite to process the whole result (e.g. sorting), the result rows will be created on the fly for every call of fetch() instead of a huge table being allocated in addition to the original data.

I am not sure if this works with other database engines (there are a couple of database interfaces on CRAN); when I tried to use RPostgreSQL, it created the whole result set at once, leading to the same memory problem. Maybe that behavior can be changed by some config variable.

Best regards,

Andreas

--
Andreas Borg
Medizinische Informatik

UNIVERSITÄTSMEDIZIN
der Johannes Gutenberg-Universität
Institut für Medizinische Biometrie, Epidemiologie und Informatik
Obere Zahlbacher Straße 69, 55131 Mainz
www.imbei.uni-mainz.de

Telefon +49 (0) 6131 175062
E-Mail: b...@imbei.uni-mainz.de

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. 
Wenn Sie nicht der
richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren 
Sie bitte sofort den
Absender und löschen Sie diese Mail. Das unerlaubte Kopieren sowie die 
unbefugte Weitergabe
dieser Mail und der darin enthaltenen Informationen ist nicht gestattet.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to