Re: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Robert Spurrier Tue, 01 Oct 2013 11:07:48 -0700

It's been a while, but just to update in case anyone is watching this:

My goal was to create a project full of annotators (both cTAKES and
home-grown), and "cherry-pick" from them at will to create smaller
pipelines that could be launched on a hadoop grid via MapReduce.


My final setup consisted of two Maven aggregator projects, Annotators and
Pipelines.

Annotators is an aggregator project containing all of the annotators and
their resources.  I am essentially following the cTAKES layout for this
one. One annotator, one module.
E.g.:
Annotators
        -ctakes-core-annotator
                Pom.xml
        -ctakes-pos-tagger-annotator
                Pom.xml
        -custom-annotator-one
                Pom.xml
ParentPom.xml


Pipelines is another aggregator project containing the source code to
generate the pipelines, and the job files that utilize the pipelines on
the hadoop grid (effectively serving as the input reader & CAS consumer).
Each pipeline is its own Maven module, and spits outs a .jar that contains
all of the classes I need to run a UIMA-MapReduce job for that specific
pipeline. It also creates a resource archive (model files, etc) that I
ship off to the Hadoop DistributedCache.
E.g.:
Pipelines
        -custom-base-pipeline
                Pom.xml
        -observation-pipeline
                Pom.xml
ParentPom.xml



Notes:
-I modified the cTAKES pom to put all of the descriptors into each
individual annotator jar as well as the classes, just so they can
conveniently be called by name.The "heavier" resources are put on the
DistributedCache.

-I create individual pipeline distributions in the Pipelines project by
using Maven Reactor Plugin at the parent project level. E.g. "maven
package -pl custom-base-pipeline  -am" . This builds custom-base-pipeline
with all of its dependencies, and all of the necessary resource

-Each pipeline has it's own Maven assembly to specify what should be
included with that pipeline's distribution and resources


The point of this was to maximize modularity, pipeline flexibility,
runtime speed, and to keep my pipeline jars as lightweight as possible.
Though it has many awesome features, I did not want to run every part of
cTAKES every time.


Cheers,
Rob










On 9/9/13 11:23 AM, "Robert Spurrier" <[email protected]> wrote:

>Actually after poking around in Maven documentation I think I have just
>figured out an approach I like.
>
>For each pipeline I wish to create, I will generate a Maven assembly
>descriptor. I will put each assembly file in the cTAKES root pom.xml.
>Hopefully this will create each pipeline for me when I run 'package'. This
>approach will still tie in nicely with the project object model/lifecycle
>of cTAKES, and generate all my custom jars as well.
>
>I will try it out and update this thread with the results
>
>Thanks,
>Rob
>
>
>On 9/9/13 10:38 AM, "Chen, Pei" <[email protected]> wrote:
>
>>Hi Robert,
>>
>>Are you planning to a process to build everything from source?
>>Or were you planning to have a build process that combines the ctakes-***
>>jars with your custom application jars?
>>
>>--Pei
>>
>>> -----Original Message-----
>>> From: Robert Spurrier [mailto:[email protected]]
>>> Sent: Monday, September 09, 2013 9:27 AM
>>> To: [email protected]
>>> Subject: Creating Runnable .JARs From A Subset of cTAKES Maven Modules
>>>
>>> Good Morning!
>>>
>>> I am trying to use cTAKES tools on a distributed computing platform. I
>>>would
>>> rather not ship the entire compiled cTAKES package (~1.5 Gb) out to the
>>> shared cache when I only need a few annotators and their resources at a
>>> time.
>>>
>>> I should first mention that I am not very familiar with Maven. I
>>>recently
>>> upgraded cTAKES from v 2.5.0, where I was configuring smaller pipelines
>>> using ant build files. This process was cumbersome however, and I can
>>> appreciate the new modular Maven project layout.  I just do not know
>>>how
>>> to effectively utilize it in a way that is flexible.
>>>
>>> Does anyone have any advice on how I can package subsets of cTAKES
>>> annotator modules and their dependencies/resources, so  I can create
>>> 'thinner' custom pipelines that are geared towards specific tasks?
>>>
>>> For example, I might ultimately want a pipeline .JAR that contains the
>>>tools to
>>> RegEx Left Ventricular Ejection Fraction measurements from free text.
>>>In
>>> such a .JAR I would not need any of the dictionary resources or
>>>negation
>>> annotators, so they could be excluded.
>>>
>>> It looks like I could create Maven assembly plugin descriptors to
>>>generate
>>> these custom .JARs, but I would like to see if anyone here has any
>>> advice/caveats before I pursue this route.
>>>
>>>
>>> Thanks,
>>> Robert Spurrier
>
>

Re: Creating Runnable .JARs From A Subset of cTAKES Maven Modules

Reply via email to