For Census data specifically you might want to look at the acs package. I learned about it from
Ray DiGiacomo, Jr. [email protected] I am doing an R-oriented US Census webinar on August 29. See this site for the recording of the webinar. http://liondatasystems.com/rug.html I think you will find this webinar interesting as it will debut the new "acs" R package. Best Regards, Ray DiGiacomo, Jr. Master R Trainer Healthcare Predictive Analytics Specialist President, Lion Data Systems LLC [email protected] 949-374-2289 Irvine, California USA On Wed, Jan 29, 2014 at 2:12 AM, andrewH <[email protected]> wrote: > On Jan 28, 2014 at 8:56pm, David Winsemius wrote: > > On Jan 28, 2014, at 8:43 PM, andrewH wrote: > >> Hi Folks! >> I have been writing a small set of utilities for dealing with files that >> are >> hard to open correctly for one reason or another, especially because they >> are too big for memory, non-rectangular, or contain odd characters or >> unexpected codings, or all of these things together. Today it suddenly hit >> me that this has probably been done, done better, and upgraded to package >> form a dozen times already. There were pointers to a couple functions >> useful >> in this regard in the Core Import/Export document. But my effort to come >> up >> with search terms that were productive of such packages was unsuccessful. > > I don't know of a package to do that. You know the quote from that Russian > author whose name I am forgetting (in "Anna Karinena" perhaps) about happy > families being all the same but unhappy families being impossible to > classify. I think it applies to datasets as well. There are too many > different dataset pathologies to allow a neat packaging approach. > > My approach has been to study the options in read.table very carefully and > if that is insufficient look at either readLines or scan as options. It is > very useful to be able to use `count.fields` with different parameter > settings of "quotes" and comment.char". Wrapping it in table() can deliver a > very compact, useful result. > > And don't forget to search the Archives if you have a regular but > non-rectangular arrangement. > > David Winsemius > Alameda, CA, USA > > Thanks, David! > > You have quickly summarized a set of techniques that it took me a long time > to learn (much of it spent disentangling the truth from various > misconceptions about the data-reading process. I don't think I have very > much to add to your list, but as always, the effectiveness depends on > correct implementation, and I have made a _lot_ of mistake in trying to > implement these in the past. Moreover, all these thing become much more > complicated if the file is too big to just read into a data frame. I am > working with Census records right now, and my primary data file is a 14 gig > csv that had me tearing my hair out trying to read it and pull out the > variables I have needed at any given moment. > > I finally did get it read and the right subset extracted, but it was a > pretty empirical process - I would just keep trying things that didn't work > until I found something that did, often not quite understanding why my > previous efforts had failed. I know that If I have to do this again six > months from now I will have no idea how I did it. So I wanted to reduce the > things that worked to functions and set up a sort of decision tree that I > could work through to find and correct at least the more common problems. > But I was hoping -- am still hoping, actually -- to find that someone else > has already done this so I could get back to my real work. It seems like the > sort of thing that could easily be buried in the 100+ pages of documentation > of one of the big utility packages like Hmisc, MASS or car. > > I have often wished there was a data manipulation and import/export task > view, with a purview to cover things like what I am talking about here, the > contents of Phil Spector's book, and packages like Hadley Wickham's plyr. > > Warmest regards, andrewH > > > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Diagnostic-and-helper-functions-for-defective-hard-to-import-files-tp4684357p4684364.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.

