TU Berlin Winter of Code Project - II. Layer: Preprocessing

Marc Hofer Sat, 28 Nov 2009 11:54:46 -0800

Hello everybody,

having already presented the draft of our architecture, I would like nowto discuss the second layer more in detail. As mentioned before we havechosen UIMA for this layer. The main aggregate currently consists of theWhitespace Tokenizer Annotator, the Snowball Annotator (Stemming) and alist-based StopwordFilter. Before processing this aggregate in amap-only job in Hadoop, we want to filter all HTML tags and forward onlythis preprocessed data to the aggregate. The reason for this is that itis difficult to change the document during processing in UIMA and it isimpractical to work all the time on documents containing HTML tags.

Furthermore we are planning to add the Tagger Annotator, whichimplements a Hidden Markov Model tagger. Here we aren't sure, whichtokens with their corresponding part of speech tags to delete or not andso using them for the feature extraction. One purpose could be to use atthe very beginning only substantives and verbs.

We are very interested in your comments and remarks and it would be niceto hear from you.


Cheers,
Marc

TU Berlin Winter of Code Project - II. Layer: Preprocessing

Reply via email to