On 01/07/2011 12:05 AM, Dieter Menne wrote: > > > Jeroen Ooms wrote: >> >> What is the most efficient method of parsing a dataframe-like structure >> that has been json encoded in record-based format rather than vector >> based. For example a structure like this: >> >> [ {"name":"joe", "gender":"male", "age":41}, {"name":"anna", >> "gender":"female", "age":23} ] >> >> RJSONIO parses this as a list of lists, which I would then have to apply >> as.data.frame to and append them to an existing dataframe, which is >> terribly slow. >> >> > > unlist is pretty fast. The solution below assumes that you know how your > structure is, so it is not very flexible, but it should show you that the > conversion to data.frame is not the bottleneck. > > # json > library(RJSONIO) > # [ {"name":"joe", "gender":"male", "age":41}, > # {"name":"anna", "gender":"female", "age":23} ] > n = 300000 > d = data.frame(name=rep(c("joe","anna"),n), > gender=rep(c("male","female"),n), > age = rep(c("23","41"),n)) > dj = toJSON(d)
This doesn't create the required structure > cat(dj) { "name": [ "joe", "anna", "joe", "anna" ], "gender": [ "male", "female", "male", "female" ], "age": [ "23", "41", "23", "41" ] } instead library(rjson) n <- 1000 name <- apply(matrix(sample(letters, n * 5, TRUE), n), 1, paste, collapse="") gender <- sample(c("male", "female"), n, TRUE) age <- ceiling(runif(n, 20, 60)) recs <- sprintf('{"name": "%s", "gender":"%s", "age":%d}', name, gender, age) j <- sprintf("[%s]", paste(recs, collapse=",")) lol <- fromJSON(j) and then with f <- function(lst) function(nm) unlist(lapply(lst, "[[", nm), use.names=FALSE) > oopt <- options(stringsAsFactors=FALSE) # convenience for 'identical' > system.time({ + df0 <- as.data.frame(Map(f(lol), names(lol[[1]]))) + }) user system elapsed 0.006 0.000 0.006 versus for instance > system.time({ + df1 <- do.call(rbind, lapply(lol, data.frame)) + }) user system elapsed 1.497 0.000 1.500 > identical(df0, df1) [1] TRUE Martin > > system.time(d1 <- fromJSON(dj)) > # user system elapsed > # 4.06 0.26 4.32 > > system.time( > dd <- data.frame( > name = unlist(d1$name), > gender = unlist(d1$gender), > age=as.numeric(unlist(d1$age))) > ) > # user system elapsed > # 1.13 0.05 1.18 > > > > -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.