Group: I am a thesis committee member on a Geography M.S. thesis project which 
involves modeling tree species ranges in Minnesota. His question is what 
percentage of the close to 700,000 cells would be appropriate to use as a 
random sample. Does anyone have thoughts? You could respond to me or to Daryn 
Hardwick (below).

Thanks, Bill Cook

William M. Cook, Ph.D.
Associate Professor
Department of Biological Sciences
Saint Cloud State University
Email: [email protected]

From: Hardwick, Daryn R. [[email protected]]
Subject: Thesis Issue - Advice Needed

Hello,

I have run into a small issue with my thesis.  I have a grid of variables 
consisting of 685,152 cells for Minnesota I am using to attempt to define which 
variables play the biggest role in a tree species current range.  As I stated 
in my proposal I was going to attempt doing this in R using the random forests 
algorithm.  However, this is WAY too much data for R to process.  I will 
therefore be creating a random sample to use in the algorithm.

I need advice on what percentage of the original data I should use in the 
random sample.  I can't find any professional source that recommends a certain 
percentage.  Even if I take 10,000 records, the output size of the algorithm 
will exceed 1 GB in file size (not to mention time needed to process).  Would a 
1% sample size (around 7,000 records) be reasonable?  This is more than many 
random surveys (e.g. president approval ratings).  And I would obviously do 
some confidence testing to ensure that it is truly a representative sample.

If you could just let me know what you think so I can proceed, I would greatly 
appreciate it!

Daryn Hardwick
Graduate Assistant, St. Cloud State University
Department of Geography and Planning

Reply via email to