Group: I am a thesis committee member on a Geography M.S. thesis project which involves modeling tree species ranges in Minnesota. His question is what percentage of the close to 700,000 cells would be appropriate to use as a random sample. Does anyone have thoughts? You could respond to me or to Daryn Hardwick (below).
Thanks, Bill Cook William M. Cook, Ph.D. Associate Professor Department of Biological Sciences Saint Cloud State University Email: [email protected] From: Hardwick, Daryn R. [[email protected]] Subject: Thesis Issue - Advice Needed Hello, I have run into a small issue with my thesis. I have a grid of variables consisting of 685,152 cells for Minnesota I am using to attempt to define which variables play the biggest role in a tree species current range. As I stated in my proposal I was going to attempt doing this in R using the random forests algorithm. However, this is WAY too much data for R to process. I will therefore be creating a random sample to use in the algorithm. I need advice on what percentage of the original data I should use in the random sample. I can't find any professional source that recommends a certain percentage. Even if I take 10,000 records, the output size of the algorithm will exceed 1 GB in file size (not to mention time needed to process). Would a 1% sample size (around 7,000 records) be reasonable? This is more than many random surveys (e.g. president approval ratings). And I would obviously do some confidence testing to ensure that it is truly a representative sample. If you could just let me know what you think so I can proceed, I would greatly appreciate it! Daryn Hardwick Graduate Assistant, St. Cloud State University Department of Geography and Planning
