Hello all, My name is Ben Holland and I am a data scientist at Abt Associates. We are working to develop a scalable NLP engine and selected UIMA with OpenNLP as our tech. We wanted to run this over Spark as well.
The huge draw for me was the awesome set of examples and documentation that UIMA provided so that I could easily get up and running. With that in mind, I am working with my company to put together code that I can give to the UIMA team using only open source libraries (specifically UIMA, Hadoop, Spark, and OpenNLP). I want to provide you with a fully functional example developed in eclipse. I will need a contact within the UIMA team at Apache. If someone could please get back to me on this, I would be most grateful. The goal of this process is to entirely mimic the CPE using the UIMA xml descriptor files over a spark cluster. I do not rely on UIMAfit or any 3rd party libraries apart from the JDBC driver. For bonus points, I hooked this up to a database that reads text, populates N cas objects with database values, processes the text, and saves particularly interesting text to the database. I pull out names. Why am I coming to you? This is a very simple application. It really is a proof of concept example but it is enough to get the architecture in place to expand on it. I hope this interests you. I found it fascinating to work on this. BTW, you should all feel extremely proud of your work. I don't make these offers often but the UIMA documentation, architecture, and code readability/stability is incredible. Within a few months, we were able to get a NLP engine into a process chain. I am very impressed. Thank you all so much, ~Ben
