[I originally sent this message by accident just to the original poster, which appears to be leading to an interesting discussion, but there's a question of my own embedded somewhere down there about additional resources, so I figured I'd better send it to everyone else too :-) ]
This is a problem that has come up in the past in work I do, related to estimation of travel demand from input data that is supposed to be "the same thing" but has been collected over different geographic and temporal scales, using different methods. One reference that I've found useful for dealing with such problems (where variations in scale, resolution, accuracy, and completeness in data sets need to be reconciled) is Banerjee, Carlin and Gelfand's book "Hierarchical Modeling and Analysis for Spatial Data". I recall that they have a pretty good discussion, for example, about how to deal with the data loss that happens at the edges of such an area, where the edges of the datasets are not aligned (see the chapter on spatial misalignment), and some of the other problems mentioned in the post are also explicitly addressed. I'll say some more below about the path I have used in solving a similar problem, but anyone reading this posting needs to understand that I'm a "user", not a "producer" of these techniques. Since this kind of problem occurs periodically in my work, I'm also interested in hearing about additional references and strategies from those who study such problems regularly and think them all the way through, as opposed to people like me who consume these techniques in a fearless rush on an as-needed basis because we have to process specific data and get it out the door in a hurry, without distorting things too badly. The basic solution strategy I've pursued is to recognize that the individual datasets each contain a grain of "truth" (i.e. reliable evidence about what one would, under perfect circumstances, expect to find at a particular location, or over a suitable unit area), but that the truth is veiled behind various random or systematic distortions within the dataset. One needs to sift out the distortions and preserve the "good" part. In that light, it's pretty easy to see that reconciling the data is basically no different than any modeling problem where you have a series of measurements that serve as predictors along with some partial measurements of a response variable you're really interested in. It boils down to finding a best fit for a model that describes the relationship between the measurements (partial, fuzzy, inconsistent, incomplete data) and the quantity of interest (crisp, regular, complete data). The fact that the model outcome purports in this case to be the same quantity that the input variables are supposedly measuring is, practically speaking, irrelevant. The modeling challenge is to analyze how the individual data sets relate to each other in delivering the complete picture, use that analysis to estimate a best fit for the model, test against some data that weren't part of the original estimation, do a gut-level (or "smell") test on the overall results, try a revised or different model, and iterate until everyone who matters is convinced that the results are good enough. I can feel the serious researchers on this list squirming in pain with that last comment, but that's the difference between working for a department of transportation, rather than a research institution. It's often amazing how "truth" and "accuracy" can have such different meanings in these two contexts... Obviously, it's more work to do such modeling than to pick and choose among the data sets, or to average them in some simple way, and it also requires that one really understand the structure, strengths and limitations of the data sets (which is usually a good thing to do and which, if my own experience is any guide, people are inclined not to do thoroughly enough without some external prodding, such as having to build a working model of the data relationships...). One ends up constructing multi-level models that take each data set as an imperfect representation at a certain "level" (scale, resolution, extent), and estimating a "best fit" of that model to produce outcome data at the desired output scale - the fitted model is then used to produce a cleaned up data set that has leveled all the available data to a single consistent scale, resolution, and extent. A simple source for the response variable might be places where all the data sets are in reasonable agreement - but be careful that the agreement is not just due to those being the most uninteresting locations! Explaining actual code in this case probably wouldn't be useful (plus, it feels better to sound like I might know something, rather than to offer direct evidence that I don't). But here's a (very simplified) overview of how I set up one such problem in my own domain: A regular indicator used in my profession (travel demand modeling) to evaluate the amount of vehicular traffic that is drawn to certain locations is the number of employees working at the location. The process (technically speaking, the attraction end of trip generation) is one that has a "natural" scale, in that estimates of travel get worse if you try to be too precise (perhaps because unaccounted background processes create variance that, at too small a scale, dominate the local estimate of expected value - I'm sure there's a better technical explanation of this phenomenon), but if you operate at too large a scale, the resulting picture, while more "accurate" (i.e. less variable) is too general and uninformative. Think of a picture composed of pixels and what happens when you zoom in too far, or zoom out too far. We try to account for that natural scale by aggregating predictors such as number of employees into geographical areas called traffic analysis zones whose size reflects the "best" (natural) scale (not too much variation, but also not too general) - that scale in our case is usually chosen by rule of thumb rather than explicit statistical analysis (so I don't have much guidance about how to pick the best scale scientifically). Of course, traffic analysis zones (particularly in dense urban areas) usually do not correspond directly to other common units of geography in which data of interest is reported. The sources of employment data themselves are many and inconsistent (and I'm entirely leaving out the fact that we usually sub-divide employment into different types such as retail, office, industrial, etc., recognizing that retail establishments attracts more trips (and different kinds of trips) per employee than offices do). So we might use privately collected data from marketing firms (where number of employees is a proxy for the power and presence of the company in a certain market), or from government labor statistics such as records kept for unemployment insurance, or census records that suggest (via journey-to-work statistics) how may people might work in a certain area. Some of these data are presented as point counts at particular workplaces, some are aggregated to geographic regions such as states or counties or economic analysis zones or census tracts. Some of it is simply always going to be missing (try asking the US military how many soldiers they expect to have stationed at a certain military base, or finding out how many employees a certain Mexican restaurant has when most of the people who work there are not, legally speaking, working there). And some of it is systematically distorted: unemployment insurance statistics don't count people who are not eligible for unemployment insurance, some industries such as temporary employment agencies report employment statistics at the headquarters, but the workers are actually spread all over the map; some reporting sources show suspicious spikes in frequency of companies with 50, 100, 500 or 1000 employees; some of the datasets will double-count certain types of employees (e.g. cafeteria workers in a corporate office building who are actually employed - and show up as statistics for - a food services sub-contractor). Based on studying each of the data sources, and using some survey data that established a "true" answer (in this case, a level of employment verified by hand that would generate answers with our given trip rates that matched traffic counts at a cordon around a small set of study locations over the period of interest) I set up a model that treated each data source as a particular guess (with some adjustment factors such as an estimated correction for contract workers based on surveys of businesses in certain industries) and then estimated the model to merge weighted components of the various data sets into the geographic unit of traffic analysis zones. I won't pretend to have done this in practice with near the thoroughness that the problem merits, but the results seemed reasonable at the time and they could in principle be adjusted consistently as data sources are updated. Good luck with your problem! -- Jeremy Raw, P.E., AICP Senior Modeling Systems Analyst | Transportation and Mobility Planning | Virginia DOT [EMAIL PROTECTED] | Desk: 804-786-0998 | Fax: 804-225-4785 ________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Monica Pisica Sent: Wednesday, December 03, 2008 4:50 PM To: [email protected] Subject: [R-sig-Geo] merging multiscale data Hi everybody, I am very much interested in hearing what is your experience with merging different scale data to obtain a uniform surface at the finest scale possible, and if not at the most optimum scale. Now I know that "optimum" is in the eye of the beholder, but maybe we can agree on a definition of optimum scale. Anyway, suppose I have an area (any area) for which I have several sets of data, raster and vector, at different scales and different spatial extents. Some overlap, some not, but the whole area is rather covered by data. Also suppose that all datasets are in the same datum / projection, so we are not concerned about it. How do I get about to merge all of that??? One idea might be to transform everything in xyz coordinates and interpolate the area and get different surfaces for different resolutions and compare the RMSE and choose the one with the smallest RMSE. Is that an option??? What about if I know the accuracy of each dataset .... How error will propagate then? What if the area is so big that the computer I work on will not be able to do the interpolation, so I will be forced to work on tiles, how do I elliminate the border differences? If you have any opinion or any good reference to work with I will really appreciate. Thanks, Monica ________________________________ Send e-mail anywhere. No map, no compass. Get your Hotmail(r) account now. <http://windowslive.com/Explore/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_an ywhere_122008> [[alternative HTML version deleted]] _______________________________________________ R-sig-Geo mailing list [email protected] https://stat.ethz.ch/mailman/listinfo/r-sig-geo
