Re: [Architecture] [CEP] NLP Toolbox

Sriskandarajah Suhothayan Wed, 20 Aug 2014 21:40:24 -0700

Thats a good compression.
Based on this I believe we have issues in implementing functions 2 & 3
using OpenNLP.
Can you evaluate others functions as well.


Suho


On Thu, Aug 21, 2014 at 9:54 AM, Chanuka Dissanayake <[email protected]>
wrote:

> We did a study on both OpenNLP and Stanford NLP libraries and looked at
> the features that could support our implementation.
> Our findings are summarised below.
>
> It seems that Stanford NLP has better capabilities when considering
> support for regular expressons and parsing.
> We would like to discuss this further and choose the appropriate
>
>
>    Feature OpenNLP StanfordNLP  Named Entity Recognizer Will identify the
> person,location,organization,time,date,money,percentage inside the given
> sentence but sentence need to be tokenized first. Includes a 4 class
> model trained for CoNLL, a 7 class model trained for MUC, and a 3 class
> model trained on both data sets for the intersection of those class sets.
> 3 class: Location, Person, Organization
> 4 class: Location, Person, Organization, Misc
> 7 class: Time, Location, Organization, Person, Money, Percent, Date
>  POS Tagger Identify:
> VP(Verb Phrase) ,NP(Noun Phrase) ,JJ(Adjective)…etc
>
> Input: Hi. How are you? This is Mike
> output: Hi_NNP How_WRB are_VBP you? _JJ This_DT is_VBZ Mike._NNP Label
> each token with the POS Tag, such as noun, verb, adjective, etc.,
> Tokenizing Separates the words which have white spaces in-between by
> default. Otherwise it can be trained to tokanize by different options. Can
> tokenize the text either by whitespace or as per the options defined
> Parsing Once given a tokanized sentence, It will construct the tree
> structure. This works out the grammatical structure of sentences in a
> tree structure. The parser provides Stanford Dependencies as well. They
> represent the grammatical relations between words in a sentence.
> Dependecies are triplets: name of the relation, governor and dependent.
> Ex: Bell, based in Los Angeles, makes and distributes electronic, computer
> and building products.
> Dependency: nsubj(distributes-10, Bell-1)
> This is like saying “the subject of distributes is Bell.”  Sentence
> Detection Detect sentence boundaries given a paragraph. Available as
> ssplit. Can split sentences as per the options defined  Regular
> Expressions Character wise regular expression only. Cannot identify named
> entities or PoS tags via regular expression Two tools are provided to
> deal with regular expressions.
> RegexNER:Can define simple rules with regular expressions and label
> entities with NE labels that are not provided.
> Ex: Bachelor of (Arts|Laws|Science|Engineering) DEGREE
> This rule will label tokens matching with the regex in first column as
> DEGREE
> TokensRegex: Can identify patterns over a list of tokens. In addition to
> java regex matching this provides syntax to match part of speech tags,
> named entity tags and lemma.
>  Ex: [ { tag:VBD } ], /University/ /of/ [{ ner:LOCATION }]
>
>
> Thanks,
> Chanuka.
>
>
> On Tue, Aug 19, 2014 at 11:11 PM, Sriskandarajah Suhothayan <[email protected]
> > wrote:
>
>> +1 looks good
>>
>> Suho
>>
>>
>> On Tue, Aug 19, 2014 at 9:56 PM, Srinath Perera <[email protected]> wrote:
>>
>>> Look good. If possible we should do this with OpenNLP as it has apache
>>> licence. However, I could not find NLP regex impl there. Please look at it
>>> in detial.
>>>
>>> --Srinath
>>>
>>>
>>> On Tue, Aug 19, 2014 at 9:52 PM, Malithi Edirisinghe <[email protected]>
>>> wrote:
>>>
>>>>
>>>> Hi All,
>>>>
>>>> We are working on a NLP Toolbox improvement in CEP. The main idea of
>>>> this improvement is to use a NLP library and let user do some NLP
>>>> operations as Siddhi extensions.
>>>>
>>>> So in our implementation we have decided to support following NLP
>>>> operations.
>>>>
>>>> *1. findNameEntityType(sentence, entityType)*
>>>>
>>>> *Description:*
>>>>
>>>> This operation takes a sentence and a predefined entity type as it's
>>>> inputs. It will return noun(s) in the sentence that match the defined
>>>> entity type, as event(s).
>>>>
>>>> *inputs:*
>>>>
>>>> sentence  : sentence to be processed
>>>> entityType: predefined entity type
>>>>  ORGANIZATION
>>>> NAME
>>>>  LOCATION
>>>>  *output:*
>>>>
>>>> matching noun(s) as event(s)
>>>>
>>>> *example:*
>>>>
>>>>  inputs:
>>>> sentence   : Alice works at WSO2
>>>>  entityType : NAME
>>>>
>>>>  output: Alice
>>>>
>>>> *2. findNLRegexPattern(sentence, regex)*
>>>>
>>>> *Description:*
>>>>
>>>> This operation takes a sentence and a regular expression as it's
>>>> inputs. It will return each match in the sentence, as an event.
>>>>
>>>> *inputs:*
>>>>
>>>> sentence  : sentence to be processed
>>>> regex       : regular expression to be matched
>>>>  *output:*
>>>>
>>>> matching pharase(s) as event(s)
>>>>
>>>> *example:*
>>>>
>>>> inputs:
>>>>  sentence   : WSO2 was found in 2005
>>>>  regex        : \\d{4}
>>>>
>>>>  output: 2005
>>>>
>>>> *3. findRelationship(sentence, regex)*
>>>>
>>>> *Description:*
>>>>
>>>> This operation takes a sentence and a regular expression as it's
>>>> inputs. For each relationship extracted from the regular expression the
>>>> operation will return a triplet; subject, object and relationship as an
>>>> event.
>>>>
>>>> *inputs:*
>>>>
>>>> sentence  : sentence to be processed
>>>> regex       : regular expression to extract the relationship
>>>>  *output:*
>>>>
>>>> triplet(s) of (subject, object, relationship) as event(s)
>>>>
>>>> *example:*
>>>>
>>>>  inputs:
>>>> sentence   : Bob works for WSO2
>>>>  regex        : works for
>>>>
>>>>  output: (Bob, WSO2, works for)
>>>>  *4. findNameEntityTypeViaDictionary(sentence, dictionary, entityType)*
>>>>
>>>> *Description:*
>>>>
>>>> This operation takes a sentence, dictionary file and a predefined
>>>> entity type as it's inputs. It will return noun(s) in the sentence of the
>>>> defined entity type, that also exists in the dictionary as event(s).
>>>>
>>>> *inputs:*
>>>>
>>>> sentence   : sentence to be processed
>>>> dictionary  : dictionary of entities of the defined entity type
>>>> entityType : predefined entity type
>>>>  ORGANIZATION
>>>>   NAME
>>>>  LOCATION
>>>>  *output:*
>>>>
>>>> matching noun(s) as event(s)
>>>>
>>>> *example:*
>>>>
>>>>  inputs:
>>>> sentence    : Bob works at WSO2
>>>>  dictionary   : (WSO2,ORACLE,IBM)
>>>> entityType  : ORGANIZATION
>>>>
>>>> output: WSO2
>>>>
>>>> Each NLP operation defined here will be implemented as a transformer
>>>> extension to Siddhi.
>>>> --
>>>>
>>>> *Malithi Edirisinghe*
>>>> Senior Software Engineer
>>>> WSO2 Inc.
>>>>
>>>> Mobile : +94 (0) 718176807
>>>>  [email protected]
>>>>
>>>
>>>
>>>
>>> --
>>> ============================
>>> Director, Research, WSO2 Inc.
>>> Visiting Faculty, University of Moratuwa
>>> Member, Apache Software Foundation
>>> Research Scientist, Lanka Software Foundation
>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>> Site: http://people.apache.org/~hemapani/
>>> Photos: http://www.flickr.com/photos/hemapani/
>>> Phone: 0772360902
>>>
>>
>>
>>
>> --
>>
>> *S. Suhothayan*
>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>  *WSO2 Inc. *http://wso2.com
>> * <http://wso2.com/>*
>> lean . enterprise . middleware
>>
>>
>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter:
>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>> http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
>>
>
>
>
> --
> Chanuka Dissanayake
> *Software Engineer | **WSO2 Inc.*; http://wso2.com
>
> Mobile: +94 71 33 63 596
> Email: [email protected]
>



-- 

*S. Suhothayan*
Technical Lead & Team Lead of WSO2 Complex Event Processor
 *WSO2 Inc. *http://wso2.com
* <http://wso2.com/>*
lean . enterprise . middleware


*cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/
<http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan
<http://twitter.com/suhothayan> | linked-in:
http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [CEP] NLP Toolbox

Reply via email to