Hi Jens, 2012/5/17 Jens Grivolla <[email protected]>
> Hi Tommaso, > > as I understand it each CAS is processed independently and without > parallelization, right? If so, what you are doing does not look that much > like MapReduce (since you don't reduce) but is closer to just running many > parallel instances on subsets of the collection. > there is no Reduce phase as I didn't want to implement a MR algorithm, but just a BSP (the only "reduce-like" phase in that implementation is the trivial one where the ProcessTraces are collected and written to file). BTW: you may find the BSP vs MapReduce paper [1] interesting. > > We are currently using Sun Grid Engine to launch CPE instances on several > nodes, getting the input data (in plain text or XMI format) from a MySQL > database and writing XMI output to the DB. That way we avoid > synchronization issues and can distribute data between instances with the > simple modulo trick in the SELECT query. > that sounds like a good approach as well. > > We also tried using UIMA AS, but the overhead seemed very big. Maybe by > just having fully colocated aggregates, each working on one CAS from > beginning to end it wouldn't be too bad, then we would just have one > central CollectionReader that dispatches to the different aggregates. You > don't seem to parallelize within the processing flow, so that's quite close > to what your example does, isn't it? > Yes, it is, my sample implementation works like this: 1. first superstep: 1a. computation: each node instantiates and initializes an AE 1b. communication: a "master" node sends the docs to the other nodes 2. second superstep 2a. computation: each node process all the docs received with the AE 2b. communication: each node sends the processing results to the "master" node 3. third superstep: 3a. computation: the "master" collects the received results and all the nodes deallocate resources 3b. communication: nothing The good thing is that this can be run locally with multi core processors to execute #docs%#processors parallel processes but, once the input is placed in HDFS rather than a local directory, also with remote machines without changing a single line of code (just configuring the Hama cluster, see [2]). For example, on my computer each of the 4 CPUs runs a different job, so during the analysis phase 4 documents/CASes at a time are processed in parallel. However this is just a sample implementation which is not supposed to be production ready, maybe just a starting point for exploring if we can use this BSP based model to create new implementations of CPE/CPM and compare them with other UIMA distributed architectures (UIMA-AS, your SGE solution, etc.). Cheers, Tommaso [1] : http://arxiv.org/pdf/1203.2081.pdf [2] : http://wiki.apache.org/hama/GettingStarted#Modes > Bye, > Jens > > > On 05/17/2012 09:25 AM, Tommaso Teofili wrote: > >> Hi all, >> >> recently I've been playing (and coding) with BSP [1] based algorithms >> using >> Apache Hama [2] (which officially graduated to TLP yesterday) and I found >> that in many cases there were significant performance boosts with respect >> to a "plain" MapReduce based algorithm, so I thought it would have made >> sense to write a UIMA collection processing algorithm using Hama. >> >> I started sketching it up on a sample project on GitHub [3] but I think it >> would make sense to put it on our sandbox so that anyone can have a >> look/use/improve/evaluate it. >> The current implementation I have just reads files from a directory inside >> the filesystem, process them in parallel and collects the ProcessTraces >> inside an output file but my idea is that it may come just as a new CPM >> implementation reading and writing from/to HDFS. >> I know it's a lot of things in few lines so feel free to ask for more >> clarifications. >> >> Have a nice day, >> Tommaso >> >> [1] : >> http://en.wikipedia.org/wiki/**Bulk_synchronous_parallel<http://en.wikipedia.org/wiki/Bulk_synchronous_parallel> >> [2] : http://incubator.apache.org/**hama<http://incubator.apache.org/hama> >> [3] : >> https://github.com/tteofili/**samplett/blob/master/uima-bsp/** >> src/main/java/com/github/**samplett/uima/bsp/**AEProcessingBSPJob.java<https://github.com/tteofili/samplett/blob/master/uima-bsp/src/main/java/com/github/samplett/uima/bsp/AEProcessingBSPJob.java> >> >> > >
