On Jun 30, 2011, at 7:28 AM, Vincent Aubanel wrote:
> Thanks for this, it's now dead fast, as one could conceivably expect.
> Simon's solution is astonishingly fast, however I had to reconstruct the
> factors and their levels which were (expectedly) lost during the c()
> operation.
One way to avoid it is to use as.character() on factors inside the parallel
function, so the pieces don't have factors. You can create a factor at the end
and it should be faster, because factor() calls as.character() anyway so it
will be a no-op by that point.
Cheers,
S
> Unfortunately this eats up some fair amount of cpu, but on a 14 columns, ~2
> million rows data frame it is still 2x faster than the elegant one line
> solution.
>
> Some figures of performance:
>
>> t <- proc.time()
>> dl <- mclapply(lsessions, mcfun, mc.cores=cores)
>> print(proc.time()-t)
> utilisateur système écoulé
> 171.894 47.696 28.713
>
>> l <- dl
>> all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x)
>> x[[i]])))
>> names(all) = names(l[[1]])
>> #attr(all, "row.names") = seq.int(all[[1]])
>> attr(all, "row.names") = c(NA, -length(all[[1]]))
>> class(all) = "data.frame"
> utilisateur système écoulé
> 0.412 0.280 0.708
>
>> all$factor <- factor(all$factor); levels(all$factor) <- c("A","B")
> ...
> utilisateur système écoulé
> 4.852 2.349 7.038
>
>> my_df = do.call(rbind, dl)
> utilisateur système écoulé
> 9.791 5.411 15.039
>
> Thanks to both of you!
>
> Vincent
>
>
> Le 29 juin 2011 à 21:48, Simon Urbanek a écrit :
>
>>
>> On Jun 29, 2011, at 2:59 PM, Mike Lawrence wrote:
>>
>>> Is the slowdown happening while mclapply runs or while you're doing
>>> the rbind? If the latter, I wonder if the code below is more efficient
>>> than using rbind inside a loop:
>>>
>>> my_df = do.call( rbind , my_list_from_mclapply )
>>>
>>
>> Another potential issue is that data frames do many sanity checks that are
>> due to row.names handling etc. If you don't use row.names *and* know in
>> advance that the concatenation is benign *and* your data types are
>> compatible, you can usually speed things up immensely by operating on lists
>> instead and converting to a dataframe at the very end by declaring the
>> resulting list conform to the data.frame class. Again, this only works if
>> you really know what you're doing but the speed up can be very big (usually
>> orders of magnitude). This is a general advice, not in particular for rbind.
>> Whether it would work for you or not is easy to test - something like
>>
>> l = my_list_from_mclapply
>> all = lapply(seq.int(l[[1]]), function(i) do.call(c, lapply(l, function(x)
>> x[[i]])))
>> names(all) = names(l[[1]])
>> attr(all, "row.names") = c(NA, -length(all[[1]]))
>> class(all) = "data.frame"
>>
>> Again, make sure all the assumptions above are satisfied before using.
>>
>> Cheers,
>> Simon
>>
>>
>>
>>>
>>>
>>> On Wed, Jun 29, 2011 at 3:34 PM, Vincent Aubanel <[email protected]>
>>> wrote:
>>>> Hi all,
>>>>
>>>> I'm using mclapply() of the multicore package for processing chunks of
>>>> data in parallel --and it works great.
>>>>
>>>> But when I want to collect all processed elements of the returned list
>>>> into one big data frame it takes ages.
>>>>
>>>> The elements are all data frames having identical column names, and I'm
>>>> using a simple rbind() inside a loop to do that. But I guess it makes some
>>>> expensive checking computations at each iteration as it gets slower and
>>>> slower as it goes. Writing out to disk individual files, concatenating
>>>> with the system and reading back from disk the resulting file is actually
>>>> faster...
>>>>
>>>> Is there a magic argument to rbind() that I'm missing, or is there any
>>>> other solution to collect the results of parallel processing efficiently?
>>>>
>>>> Thanks,
>>>> Vincent
>>>>
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> [email protected]
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>>
>>>
>>> _______________________________________________
>>> R-SIG-Mac mailing list
>>> [email protected]
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>
>>>
>>
>
>
_______________________________________________
R-SIG-Mac mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-mac