IF you are looking at large numbers of independent images then hadoop should be close to perfect for this analysis (the problem is embarrassingly parallel). If you are looking at video, then you can still do quite well by building what is essentially a probabilistic list of recognized items in the video stream in the map phase, giving all frames from a single shot the same reduce key. Then in the reduce phase, you can correlate the possible objects and their probabilities according to object persistence models. It would be good to do another pass after that to do scene to scene correlations. This formulation gives you near perfect parallelism as well.
For NLP, the problem at the level of phrasal analysis can also be made trivially parallel if you have large numbers of documents. Again, you may need to do a secondary pass to find duplicated references across multiple documents but this is usually far less intensive than the original analysis. Standard scientific HPC architectures are all about facilitating arbitrary communication patterns and process boundaries. This is exceedingly hard to do really well and few systems attain really good performance. Hadoop is all about working with a really simple primitive that is so simple that it can be implemented really well with simple and cheap hardware. What is surprising (a bit) is that so many problems can be well expressed as map-reduce programs. Sometimes this is only true at really large scale where correlations become small (allowing the map phase to do useful work on many sub-units), sometimes it requires relatively large intermediate data (such as many graph algorithms). The fact is, however, that it works remarkably well. On 12/4/07 7:12 PM, "Bob Futrelle" <[EMAIL PROTECTED]> wrote: > For us, we want to do pattern recognition, turning > raster images into collections of the objects we discover in the > images. Another focus for us is NLP, esp. phrasal analysis.
