Hi all,
I wonder if you can help me.
THE PROBLEM
I want to train and test a GLM with some large datasets. I am running into some
problems when I flatten my tables together to feed into the GLM model, as it
produces a very large table which is far too big for the memory on my computer.
THREE TABLES - Pipes, Weekly Weather data, Bursts
I have three tables, which are all related to each other.
(1) Pipe cohorts (114,000 rows) with a range of explanatory variables.
((1) Linking fields: (A) Pipe cohort ID, (B) weathercell_ID)
(2) Explanatory Weekly Weather data 12 years (e.g. 624 weeks for each pipe
cohort) ((2) Linking fields: (C) week, (B) weathercell_ID)
(3) Bursts (40,000 bursts)
((3) Linking
fields: (A) Pipe cohort ID, (C) week)
Effectively, the combination of tables (1) and (2) make the population. Table
(3) are the events, or failures.
JOINING THE THREE TABLES
I have previously had far fewer pipe cohort rows.
What I have been doing till now is joining the (1) pipe cohorts data to the
(2) weekly weather data.
This repeats the pipe cohort data, each week, for the 12 years, which, now,
makes a very long table e.g. 624 x 114,000 rows = 71 million rows.
I would then join the (3) burst data to that to see how many bursts there were
that week, on that pipe cohort.
This made a large, flat file, which I could feed into GLM.
This worked ok when there are not so many pipe cohorts, but now there are
114,000 rows, when I join the data tables I produce a MASSIVE table (many, many
GB) which kills my computers.
RELATIONAL DATABASE APPROACH?
I am thinking it would be better to have a relational database structure where,
for each data point (row) being brought into the BIGLM model, it take the three
tables and looks up the appropriate values each time, using the defined join
fields (A, B +C), feeds that into the model, then goes back and looks up the
next point.
ADVICE?
How would you approach this problem?
I have the data prepared in the three tables.
I need to fit lots of models to see which variables give me the best AIC
(output: lots of model fits)
Then predict bursts using the best model and the available (1) pipe and (2)
weather data
Would you use the package BIGLM, linking to a sqlite database? (Or do something
completely different?)
Many thanks,
Tim
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.