Hadoop has enormous startup costs that are relatively inherent in the current design.
Most notably, mappers and reducers are executed in a standalone JVM (ostensibly for safety reasons). On 4/17/08 6:00 PM, "Karl Wettin" <[EMAIL PROTECTED]> wrote: > Is it possible to execute a job more than once? > > I use map reduce when adding a new instance to a hierarchial cluster > tree. It finds the least distant node and inserts the new instance as a > sibling to that node. > > As far as I know it is in very the nature of this algorithm that one > inserts one instance at a time, that this is how the second dimension is > created that makes it better than a vector cluster. It would be possible > to map all permutations of instances and skip the reduction, but that > would result in many more calulations than iteratively training the tree > as the latter only require one to test against the instances already > inserted to the tree. > > Iteratively training this tree using Hadoop means executing one job per > instance that measure distance to all instances in a file that I also > append the new instance to once inserted in the tree. > > All of above is very inefficient, especially with a young tree that > could be trained in nanoseconds locally. So I do that until it takes 20 > seconds to insert an instance. > > But really, this is all Hadoop framework overhead. I'm not quite sure of > all it does when I execute a job, but it seems like quite a lot. And all > I'm doing is executing a couple of identical jobs over and over again > using new data. > > It would be very nice if I it just took a few milliseconds to do that. > > > karl