Indeed.  And those datasets exist.

It is also plausible that this full data scan approach will fail when you
want the forest building to take less time.

It is also plausible that a full data scan approach fails to improve enough
on a non-parallel implementation.  This would happen if a significantly
large fraction of the entire forest could be built on a single node.  That
would happen if the CPU requirements for forest building are overshadowed by
the I/O cost of scanning the data set.  This would imply that there is a
small limit to the amount of parallelism that would help.

You will know much more about this after you finish the non-parallel
implementation than either of us knows now.

On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim <a_dene...@yahoo.fr>wrote:

> There is still one case that this approach, even out-of-core, cannot
> handle: very large datasets that cannot fit in the node hard-drive, and thus
> must be distributed across the cluster.
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to