Hi all, recently I've been playing (and coding) with BSP [1] based algorithms using Apache Hama [2] (which officially graduated to TLP yesterday) and I found that in many cases there were significant performance boosts with respect to a "plain" MapReduce based algorithm, so I thought it would have made sense to write a UIMA collection processing algorithm using Hama.
I started sketching it up on a sample project on GitHub [3] but I think it would make sense to put it on our sandbox so that anyone can have a look/use/improve/evaluate it. The current implementation I have just reads files from a directory inside the filesystem, process them in parallel and collects the ProcessTraces inside an output file but my idea is that it may come just as a new CPM implementation reading and writing from/to HDFS. I know it's a lot of things in few lines so feel free to ask for more clarifications. Have a nice day, Tommaso [1] : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel [2] : http://incubator.apache.org/hama [3] : https://github.com/tteofili/samplett/blob/master/uima-bsp/src/main/java/com/github/samplett/uima/bsp/AEProcessingBSPJob.java
