It's been a while, but just to update in case anyone is watching this: My goal was to create a project full of annotators (both cTAKES and home-grown), and "cherry-pick" from them at will to create smaller pipelines that could be launched on a hadoop grid via MapReduce.
My final setup consisted of two Maven aggregator projects, Annotators and Pipelines. Annotators is an aggregator project containing all of the annotators and their resources. I am essentially following the cTAKES layout for this one. One annotator, one module. E.g.: Annotators -ctakes-core-annotator Pom.xml -ctakes-pos-tagger-annotator Pom.xml -custom-annotator-one Pom.xml ParentPom.xml Pipelines is another aggregator project containing the source code to generate the pipelines, and the job files that utilize the pipelines on the hadoop grid (effectively serving as the input reader & CAS consumer). Each pipeline is its own Maven module, and spits outs a .jar that contains all of the classes I need to run a UIMA-MapReduce job for that specific pipeline. It also creates a resource archive (model files, etc) that I ship off to the Hadoop DistributedCache. E.g.: Pipelines -custom-base-pipeline Pom.xml -observation-pipeline Pom.xml ParentPom.xml Notes: -I modified the cTAKES pom to put all of the descriptors into each individual annotator jar as well as the classes, just so they can conveniently be called by name.The "heavier" resources are put on the DistributedCache. -I create individual pipeline distributions in the Pipelines project by using Maven Reactor Plugin at the parent project level. E.g. "maven package -pl custom-base-pipeline -am" . This builds custom-base-pipeline with all of its dependencies, and all of the necessary resource -Each pipeline has it's own Maven assembly to specify what should be included with that pipeline's distribution and resources The point of this was to maximize modularity, pipeline flexibility, runtime speed, and to keep my pipeline jars as lightweight as possible. Though it has many awesome features, I did not want to run every part of cTAKES every time. Cheers, Rob On 9/9/13 11:23 AM, "Robert Spurrier" <robert.spurr...@explorys.com> wrote: >Actually after poking around in Maven documentation I think I have just >figured out an approach I like. > >For each pipeline I wish to create, I will generate a Maven assembly >descriptor. I will put each assembly file in the cTAKES root pom.xml. >Hopefully this will create each pipeline for me when I run 'package'. This >approach will still tie in nicely with the project object model/lifecycle >of cTAKES, and generate all my custom jars as well. > >I will try it out and update this thread with the results > >Thanks, >Rob > > >On 9/9/13 10:38 AM, "Chen, Pei" <pei.c...@childrens.harvard.edu> wrote: > >>Hi Robert, >> >>Are you planning to a process to build everything from source? >>Or were you planning to have a build process that combines the ctakes-*** >>jars with your custom application jars? >> >>--Pei >> >>> -----Original Message----- >>> From: Robert Spurrier [mailto:robert.spurr...@explorys.com] >>> Sent: Monday, September 09, 2013 9:27 AM >>> To: dev@ctakes.apache.org >>> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven Modules >>> >>> Good Morning! >>> >>> I am trying to use cTAKES tools on a distributed computing platform. I >>>would >>> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the >>> shared cache when I only need a few annotators and their resources at a >>> time. >>> >>> I should first mention that I am not very familiar with Maven. I >>>recently >>> upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines >>> using ant build files. This process was cumbersome however, and I can >>> appreciate the new modular Maven project layout. I just do not know >>>how >>> to effectively utilize it in a way that is flexible. >>> >>> Does anyone have any advice on how I can package subsets of cTAKES >>> annotator modules and their dependencies/resources, so I can create >>> 'thinner' custom pipelines that are geared towards specific tasks? >>> >>> For example, I might ultimately want a pipeline .JAR that contains the >>>tools to >>> RegEx Left Ventricular Ejection Fraction measurements from free text. >>>In >>> such a .JAR I would not need any of the dictionary resources or >>>negation >>> annotators, so they could be excluded. >>> >>> It looks like I could create Maven assembly plugin descriptors to >>>generate >>> these custom .JARs, but I would like to see if anyone here has any >>> advice/caveats before I pursue this route. >>> >>> >>> Thanks, >>> Robert Spurrier > >