Hi Tommaso,
as I understand it each CAS is processed independently and without
parallelization, right? If so, what you are doing does not look that
much like MapReduce (since you don't reduce) but is closer to just
running many parallel instances on subsets of the collection.
We are currently using Sun Grid Engine to launch CPE instances on
several nodes, getting the input data (in plain text or XMI format) from
a MySQL database and writing XMI output to the DB. That way we avoid
synchronization issues and can distribute data between instances with
the simple modulo trick in the SELECT query.
We also tried using UIMA AS, but the overhead seemed very big. Maybe by
just having fully colocated aggregates, each working on one CAS from
beginning to end it wouldn't be too bad, then we would just have one
central CollectionReader that dispatches to the different aggregates.
You don't seem to parallelize within the processing flow, so that's
quite close to what your example does, isn't it?
Bye,
Jens
On 05/17/2012 09:25 AM, Tommaso Teofili wrote:
Hi all,
recently I've been playing (and coding) with BSP [1] based algorithms using
Apache Hama [2] (which officially graduated to TLP yesterday) and I found
that in many cases there were significant performance boosts with respect
to a "plain" MapReduce based algorithm, so I thought it would have made
sense to write a UIMA collection processing algorithm using Hama.
I started sketching it up on a sample project on GitHub [3] but I think it
would make sense to put it on our sandbox so that anyone can have a
look/use/improve/evaluate it.
The current implementation I have just reads files from a directory inside
the filesystem, process them in parallel and collects the ProcessTraces
inside an output file but my idea is that it may come just as a new CPM
implementation reading and writing from/to HDFS.
I know it's a lot of things in few lines so feel free to ask for more
clarifications.
Have a nice day,
Tommaso
[1] : http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
[2] : http://incubator.apache.org/hama
[3] :
https://github.com/tteofili/samplett/blob/master/uima-bsp/src/main/java/com/github/samplett/uima/bsp/AEProcessingBSPJob.java