Hi Tommaso,

as I understand it each CAS is processed independently and without parallelization, right? If so, what you are doing does not look that much like MapReduce (since you don't reduce) but is closer to just running many parallel instances on subsets of the collection.

We are currently using Sun Grid Engine to launch CPE instances on several nodes, getting the input data (in plain text or XMI format) from a MySQL database and writing XMI output to the DB. That way we avoid synchronization issues and can distribute data between instances with the simple modulo trick in the SELECT query.

We also tried using UIMA AS, but the overhead seemed very big. Maybe by just having fully colocated aggregates, each working on one CAS from beginning to end it wouldn't be too bad, then we would just have one central CollectionReader that dispatches to the different aggregates. You don't seem to parallelize within the processing flow, so that's quite close to what your example does, isn't it?

Bye,
Jens

On 05/17/2012 09:25 AM, Tommaso Teofili wrote:
Hi all,

recently I've been playing (and coding) with BSP [1] based algorithms using
Apache Hama [2] (which officially graduated to TLP yesterday) and I found
that in many cases there were significant performance boosts with respect
to a "plain" MapReduce based algorithm, so I thought it would have made
sense to write a UIMA collection processing algorithm using Hama.

I started sketching it up on a sample project on GitHub [3] but I think it
would make sense to put it on our sandbox so that anyone can have a
look/use/improve/evaluate it.
The current implementation I have just reads files from a directory inside
the filesystem, process them in parallel and collects the ProcessTraces
inside an output file but my idea is that it may come just as a new CPM
implementation reading and writing from/to HDFS.
I know it's a lot of things in few lines so feel free to ask for more
clarifications.

Have a nice day,
Tommaso

[1] : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
[2] : http://incubator.apache.org/hama
[3] :
https://github.com/tteofili/samplett/blob/master/uima-bsp/src/main/java/com/github/samplett/uima/bsp/AEProcessingBSPJob.java



Reply via email to