IF you are looking at large numbers of independent images then hadoop should
be close to perfect for this analysis (the problem is embarrassingly
parallel).  If you are looking at video, then you can still do quite well by
building what is essentially a probabilistic list of recognized items in the
video stream in the map phase, giving all frames from a single shot the same
reduce key.  Then in the reduce phase, you can correlate the possible
objects and their probabilities according to object persistence models.  It
would be good to do another pass after that to do scene to scene
correlations.  This formulation gives you near perfect parallelism as well.

For NLP, the problem at the level of phrasal analysis can also be made
trivially parallel if you have large numbers of documents.  Again, you may
need to do a secondary pass to find duplicated references across multiple
documents but this is usually far less intensive than the original analysis.

Standard scientific HPC architectures are all about facilitating arbitrary
communication patterns and process boundaries.  This is exceedingly hard to
do really well and few systems attain really good performance.  Hadoop is
all about working with a really simple primitive that is so simple that it
can be implemented really well with simple and cheap hardware.  What is
surprising (a bit) is that so many problems can be well expressed as
map-reduce programs.  Sometimes this is only true at really large scale
where correlations become small (allowing the map phase to do useful work on
many sub-units), sometimes it requires relatively large intermediate data
(such as many graph algorithms).  The fact is, however, that it works
remarkably well.

On 12/4/07 7:12 PM, "Bob Futrelle" <[EMAIL PROTECTED]> wrote:

> For us, we want to do pattern recognition, turning
> raster images into collections of the objects we discover in the
> images. Another focus for us is NLP, esp. phrasal analysis.

Reply via email to