Hadoop has enormous startup costs that are relatively inherent in the
current design.

Most notably, mappers and reducers are executed in a standalone JVM
(ostensibly for safety reasons).



On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote:

> Is it possible to execute a job more than once?
> 
> I use map reduce when adding a new instance to a hierarchial cluster
> tree. It finds the least distant node and inserts the new instance as a
> sibling to that node.
> 
> As far as I know it is in very the nature of this algorithm that one
> inserts one instance at a time, that this is how the second dimension is
> created that makes it better than a vector cluster. It would be possible
> to map all permutations of instances and skip the reduction, but that
> would result in many more calulations than iteratively training the tree
> as the latter only require one to test against the instances already
> inserted to the tree.
> 
> Iteratively training this tree using Hadoop means executing one job per
> instance that measure distance to all instances in a file that I also
> append the new instance to once inserted in the tree.
> 
> All of above is very inefficient, especially with a young tree that
> could be trained in nanoseconds locally. So I do that until it takes 20
> seconds to insert an instance.
> 
> But really, this is all Hadoop framework overhead. I'm not quite sure of
> all it does when I execute a job, but it seems like quite a lot. And all
> I'm doing is executing a couple of identical jobs over and over again
> using new data.
> 
> It would be very nice if I it just took a few milliseconds to do that.
> 
> 
>        karl

Reply via email to