Thanks for this, it's now dead fast, as one could conceivably expect.
Simon's solution is astonishingly fast, however I had to reconstruct the 
factors and their levels which were (expectedly) lost during the c() operation. 
Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2 
million rows data frame it is still 2x faster than the elegant one line 
solution.

Some figures of performance:

> t <- proc.time()
> dl <- mclapply(lsessions, mcfun, mc.cores=cores)
> print(proc.time()-t)
utilisateur     système      écoulé 
    171.894      47.696      28.713

> l <- dl
> all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) 
> x[[i]])))
> names(all) = names(l[[1]])
> #attr(all, "row.names") = seq.int(all[[1]])
> attr(all, "row.names") = c(NA, -length(all[[1]]))
> class(all) = "data.frame"
utilisateur     système      écoulé 
      0.412       0.280       0.708 

> all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
...
utilisateur     système      écoulé 
      4.852       2.349       7.038

> my_df = do.call(rbind, dl)
utilisateur     système      écoulé 
      9.791       5.411      15.039 

Thanks to both of you!

Vincent


Le 29 juin 2011 à 21:48, Simon Urbanek a écrit :

> 
> On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
> 
>> Is the slowdown happening while mclapply runs or while you're doing
>> the rbind? If the latter, I wonder if the code below is more efficient
>> than using rbind inside a loop:
>> 
>> my_df = do.call( rbind , my_list_from_mclapply )
>> 
> 
> Another potential issue is that data frames do many sanity checks that are 
> due to row.names handling etc. If you don't use row.names *and* know in 
> advance that the concatenation is benign *and* your data types are 
> compatible, you can usually speed things up immensely by operating on lists 
> instead and converting to a dataframe at the very end by declaring the 
> resulting list conform to the data.frame class. Again, this only works if you 
> really know what you're doing but the speed up can be very big (usually 
> orders of magnitude). This is a general advice, not in particular for rbind. 
> Whether it would work for you or not is easy to test - something like
> 
> l = my_list_from_mclapply
> all =  lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x) 
> x[[i]])))
> names(all) = names(l[[1]])
> attr(all, "row.names") = c(NA, -length(all[[1]]))
> class(all) = "data.frame"
> 
> Again, make sure all the assumptions above are satisfied before using.
> 
> Cheers,
> Simon
> 
> 
> 
>> 
>> 
>> On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <[email protected]> 
>> wrote:
>>> Hi all,
>>> 
>>> I'm using mclapply() of the multicore package for processing chunks of data 
>>> in parallel --and it works great.
>>> 
>>> But when I want to collect all processed elements of the returned list into 
>>> one big data frame it takes ages.
>>> 
>>> The elements are all data frames having identical column names, and I'm 
>>> using a simple rbind() inside a loop to do that. But I guess it makes some 
>>> expensive checking computations at each iteration as it gets slower and 
>>> slower as it goes. Writing out to disk individual files, concatenating with 
>>> the system and reading back from disk the resulting file is actually 
>>> faster...
>>> 
>>> Is there a magic argument to rbind() that I'm missing, or is there any 
>>> other solution to collect the results of parallel processing efficiently?
>>> 
>>> Thanks,
>>> Vincent
>>> 
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>> 
>> 
>> _______________________________________________
>> R-SIG-Mac mailing list
>> [email protected]
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>> 
>> 
> 

_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Reply via email to