Thats a good compression. Based on this I believe we have issues in implementing functions 2 & 3 using OpenNLP. Can you evaluate others functions as well.
Suho On Thu, Aug 21, 2014 at 9:54 AM, Chanuka Dissanayake <[email protected]> wrote: > We did a study on both OpenNLP and Stanford NLP libraries and looked at > the features that could support our implementation. > Our findings are summarised below. > > It seems that Stanford NLP has better capabilities when considering > support for regular expressons and parsing. > We would like to discuss this further and choose the appropriate > > > Feature OpenNLP StanfordNLP Named Entity Recognizer Will identify the > person,location,organization,time,date,money,percentage inside the given > sentence but sentence need to be tokenized first. Includes a 4 class > model trained for CoNLL, a 7 class model trained for MUC, and a 3 class > model trained on both data sets for the intersection of those class sets. > 3 class: Location, Person, Organization > 4 class: Location, Person, Organization, Misc > 7 class: Time, Location, Organization, Person, Money, Percent, Date > POS Tagger Identify: > VP(Verb Phrase) ,NP(Noun Phrase) ,JJ(Adjective)…etc > > Input: Hi. How are you? This is Mike > output: Hi_NNP How_WRB are_VBP you? _JJ This_DT is_VBZ Mike._NNP Label > each token with the POS Tag, such as noun, verb, adjective, etc., > Tokenizing Separates the words which have white spaces in-between by > default. Otherwise it can be trained to tokanize by different options. Can > tokenize the text either by whitespace or as per the options defined > Parsing Once given a tokanized sentence, It will construct the tree > structure. This works out the grammatical structure of sentences in a > tree structure. The parser provides Stanford Dependencies as well. They > represent the grammatical relations between words in a sentence. > Dependecies are triplets: name of the relation, governor and dependent. > Ex: Bell, based in Los Angeles, makes and distributes electronic, computer > and building products. > Dependency: nsubj(distributes-10, Bell-1) > This is like saying “the subject of distributes is Bell.” Sentence > Detection Detect sentence boundaries given a paragraph. Available as > ssplit. Can split sentences as per the options defined Regular > Expressions Character wise regular expression only. Cannot identify named > entities or PoS tags via regular expression Two tools are provided to > deal with regular expressions. > RegexNER:Can define simple rules with regular expressions and label > entities with NE labels that are not provided. > Ex: Bachelor of (Arts|Laws|Science|Engineering) DEGREE > This rule will label tokens matching with the regex in first column as > DEGREE > TokensRegex: Can identify patterns over a list of tokens. In addition to > java regex matching this provides syntax to match part of speech tags, > named entity tags and lemma. > Ex: [ { tag:VBD } ], /University/ /of/ [{ ner:LOCATION }] > > > Thanks, > Chanuka. > > > On Tue, Aug 19, 2014 at 11:11 PM, Sriskandarajah Suhothayan <[email protected] > > wrote: > >> +1 looks good >> >> Suho >> >> >> On Tue, Aug 19, 2014 at 9:56 PM, Srinath Perera <[email protected]> wrote: >> >>> Look good. If possible we should do this with OpenNLP as it has apache >>> licence. However, I could not find NLP regex impl there. Please look at it >>> in detial. >>> >>> --Srinath >>> >>> >>> On Tue, Aug 19, 2014 at 9:52 PM, Malithi Edirisinghe <[email protected]> >>> wrote: >>> >>>> >>>> Hi All, >>>> >>>> We are working on a NLP Toolbox improvement in CEP. The main idea of >>>> this improvement is to use a NLP library and let user do some NLP >>>> operations as Siddhi extensions. >>>> >>>> So in our implementation we have decided to support following NLP >>>> operations. >>>> >>>> *1. findNameEntityType(sentence, entityType)* >>>> >>>> *Description:* >>>> >>>> This operation takes a sentence and a predefined entity type as it's >>>> inputs. It will return noun(s) in the sentence that match the defined >>>> entity type, as event(s). >>>> >>>> *inputs:* >>>> >>>> sentence : sentence to be processed >>>> entityType: predefined entity type >>>> ORGANIZATION >>>> NAME >>>> LOCATION >>>> *output:* >>>> >>>> matching noun(s) as event(s) >>>> >>>> *example:* >>>> >>>> inputs: >>>> sentence : Alice works at WSO2 >>>> entityType : NAME >>>> >>>> output: Alice >>>> >>>> *2. findNLRegexPattern(sentence, regex)* >>>> >>>> *Description:* >>>> >>>> This operation takes a sentence and a regular expression as it's >>>> inputs. It will return each match in the sentence, as an event. >>>> >>>> *inputs:* >>>> >>>> sentence : sentence to be processed >>>> regex : regular expression to be matched >>>> *output:* >>>> >>>> matching pharase(s) as event(s) >>>> >>>> *example:* >>>> >>>> inputs: >>>> sentence : WSO2 was found in 2005 >>>> regex : \\d{4} >>>> >>>> output: 2005 >>>> >>>> *3. findRelationship(sentence, regex)* >>>> >>>> *Description:* >>>> >>>> This operation takes a sentence and a regular expression as it's >>>> inputs. For each relationship extracted from the regular expression the >>>> operation will return a triplet; subject, object and relationship as an >>>> event. >>>> >>>> *inputs:* >>>> >>>> sentence : sentence to be processed >>>> regex : regular expression to extract the relationship >>>> *output:* >>>> >>>> triplet(s) of (subject, object, relationship) as event(s) >>>> >>>> *example:* >>>> >>>> inputs: >>>> sentence : Bob works for WSO2 >>>> regex : works for >>>> >>>> output: (Bob, WSO2, works for) >>>> *4. findNameEntityTypeViaDictionary(sentence, dictionary, entityType)* >>>> >>>> *Description:* >>>> >>>> This operation takes a sentence, dictionary file and a predefined >>>> entity type as it's inputs. It will return noun(s) in the sentence of the >>>> defined entity type, that also exists in the dictionary as event(s). >>>> >>>> *inputs:* >>>> >>>> sentence : sentence to be processed >>>> dictionary : dictionary of entities of the defined entity type >>>> entityType : predefined entity type >>>> ORGANIZATION >>>> NAME >>>> LOCATION >>>> *output:* >>>> >>>> matching noun(s) as event(s) >>>> >>>> *example:* >>>> >>>> inputs: >>>> sentence : Bob works at WSO2 >>>> dictionary : (WSO2,ORACLE,IBM) >>>> entityType : ORGANIZATION >>>> >>>> output: WSO2 >>>> >>>> Each NLP operation defined here will be implemented as a transformer >>>> extension to Siddhi. >>>> -- >>>> >>>> *Malithi Edirisinghe* >>>> Senior Software Engineer >>>> WSO2 Inc. >>>> >>>> Mobile : +94 (0) 718176807 >>>> [email protected] >>>> >>> >>> >>> >>> -- >>> ============================ >>> Director, Research, WSO2 Inc. >>> Visiting Faculty, University of Moratuwa >>> Member, Apache Software Foundation >>> Research Scientist, Lanka Software Foundation >>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>> Site: http://people.apache.org/~hemapani/ >>> Photos: http://www.flickr.com/photos/hemapani/ >>> Phone: 0772360902 >>> >> >> >> >> -- >> >> *S. Suhothayan* >> Technical Lead & Team Lead of WSO2 Complex Event Processor >> *WSO2 Inc. *http://wso2.com >> * <http://wso2.com/>* >> lean . enterprise . middleware >> >> >> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: >> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter: >> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: >> http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>* >> > > > > -- > Chanuka Dissanayake > *Software Engineer | **WSO2 Inc.*; http://wso2.com > > Mobile: +94 71 33 63 596 > Email: [email protected] > -- *S. Suhothayan* Technical Lead & Team Lead of WSO2 Complex Event Processor *WSO2 Inc. *http://wso2.com * <http://wso2.com/>* lean . enterprise . middleware *cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
_______________________________________________ Architecture mailing list [email protected] https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
