Re: [Architecture] [CEP] NLP Toolbox

Chanuka Dissanayake Wed, 20 Aug 2014 21:27:07 -0700

We did a study on both OpenNLP and Stanford NLP libraries and looked at the
features that could support our implementation.
Our findings are summarised below.

It seems that Stanford NLP has better capabilities when considering support
for regular expressons and parsing.
We would like to discuss this further and choose the appropriate

  Feature OpenNLP StanfordNLP  Named Entity Recognizer Will identify the
person,location,organization,time,date,money,percentage inside the given
sentence but sentence need to be tokenized first. Includes a 4 class model
trained for CoNLL, a 7 class model trained for MUC, and a 3 class model
trained on both data sets for the intersection of those class sets.
3 class: Location, Person, Organization
4 class: Location, Person, Organization, Misc
7 class: Time, Location, Organization, Person, Money, Percent, Date
 POS Tagger Identify:
VP(Verb Phrase) ,NP(Noun Phrase) ,JJ(Adjective)…etc

Input: Hi. How are you? This is Mike
output: Hi_NNP How_WRB are_VBP you? _JJ This_DT is_VBZ Mike._NNP Label each
token with the POS Tag, such as noun, verb, adjective, etc.,
Tokenizing Separates
the words which have white spaces in-between by default. Otherwise it can
be trained to tokanize by different options. Can tokenize the text either
by whitespace or as per the options defined  Parsing Once given a tokanized
sentence, It will construct the tree structure. This works out the
grammatical structure of sentences in a tree structure. The parser provides
Stanford Dependencies as well. They represent the grammatical relations
between words in a sentence. Dependecies are triplets: name of the
relation, governor and dependent.
Ex: Bell, based in Los Angeles, makes and distributes electronic, computer
and building products.
Dependency: nsubj(distributes-10, Bell-1)
This is like saying “the subject of distributes is Bell.”  Sentence
Detection Detect sentence boundaries given a paragraph. Available as
ssplit. Can split sentences as per the options defined  Regular
Expressions Character
wise regular expression only. Cannot identify named entities or PoS tags
via regular expression Two tools are provided to deal with regular
expressions.
RegexNER:Can define simple rules with regular expressions and label
entities with NE labels that are not provided.
Ex: Bachelor of (Arts|Laws|Science|Engineering) DEGREE
This rule will label tokens matching with the regex in first column as
DEGREE
TokensRegex: Can identify patterns over a list of tokens. In addition to
java regex matching this provides syntax to match part of speech tags,
named entity tags and lemma.
 Ex: [ { tag:VBD } ], /University/ /of/ [{ ner:LOCATION }]

Thanks,
Chanuka.

On Tue, Aug 19, 2014 at 11:11 PM, Sriskandarajah Suhothayan <[email protected]>
wrote:

> +1 looks good
>
> Suho
>
>
> On Tue, Aug 19, 2014 at 9:56 PM, Srinath Perera <[email protected]> wrote:
>
>> Look good. If possible we should do this with OpenNLP as it has apache
>> licence. However, I could not find NLP regex impl there. Please look at it
>> in detial.
>>
>> --Srinath
>>
>>
>> On Tue, Aug 19, 2014 at 9:52 PM, Malithi Edirisinghe <[email protected]>
>> wrote:
>>
>>>
>>> Hi All,
>>>
>>> We are working on a NLP Toolbox improvement in CEP. The main idea of
>>> this improvement is to use a NLP library and let user do some NLP
>>> operations as Siddhi extensions.
>>>
>>> So in our implementation we have decided to support following NLP
>>> operations.
>>>
>>> *1. findNameEntityType(sentence, entityType)*
>>>
>>> *Description:*
>>>
>>> This operation takes a sentence and a predefined entity type as it's
>>> inputs. It will return noun(s) in the sentence that match the defined
>>> entity type, as event(s).
>>>
>>> *inputs:*
>>>
>>> sentence  : sentence to be processed
>>> entityType: predefined entity type
>>>  ORGANIZATION
>>> NAME
>>>  LOCATION
>>>  *output:*
>>>
>>> matching noun(s) as event(s)
>>>
>>> *example:*
>>>
>>>  inputs:
>>> sentence   : Alice works at WSO2
>>>  entityType : NAME
>>>
>>>  output: Alice
>>>
>>> *2. findNLRegexPattern(sentence, regex)*
>>>
>>> *Description:*
>>>
>>> This operation takes a sentence and a regular expression as it's inputs.
>>> It will return each match in the sentence, as an event.
>>>
>>> *inputs:*
>>>
>>> sentence  : sentence to be processed
>>> regex       : regular expression to be matched
>>>  *output:*
>>>
>>> matching pharase(s) as event(s)
>>>
>>> *example:*
>>>
>>> inputs:
>>>  sentence   : WSO2 was found in 2005
>>>  regex        : \\d{4}
>>>
>>>  output: 2005
>>>
>>> *3. findRelationship(sentence, regex)*
>>>
>>> *Description:*
>>>
>>> This operation takes a sentence and a regular expression as it's inputs.
>>> For each relationship extracted from the regular expression the operation
>>> will return a triplet; subject, object and relationship as an event.
>>>
>>> *inputs:*
>>>
>>> sentence  : sentence to be processed
>>> regex       : regular expression to extract the relationship
>>>  *output:*
>>>
>>> triplet(s) of (subject, object, relationship) as event(s)
>>>
>>> *example:*
>>>
>>>  inputs:
>>> sentence   : Bob works for WSO2
>>>  regex        : works for
>>>
>>>  output: (Bob, WSO2, works for)
>>>  *4. findNameEntityTypeViaDictionary(sentence, dictionary, entityType)*
>>>
>>> *Description:*
>>>
>>> This operation takes a sentence, dictionary file and a predefined entity
>>> type as it's inputs. It will return noun(s) in the sentence of the defined
>>> entity type, that also exists in the dictionary as event(s).
>>>
>>> *inputs:*
>>>
>>> sentence   : sentence to be processed
>>> dictionary  : dictionary of entities of the defined entity type
>>> entityType : predefined entity type
>>>  ORGANIZATION
>>>   NAME
>>>  LOCATION
>>>  *output:*
>>>
>>> matching noun(s) as event(s)
>>>
>>> *example:*
>>>
>>>  inputs:
>>> sentence    : Bob works at WSO2
>>>  dictionary   : (WSO2,ORACLE,IBM)
>>> entityType  : ORGANIZATION
>>>
>>> output: WSO2
>>>
>>> Each NLP operation defined here will be implemented as a transformer
>>> extension to Siddhi.
>>> --
>>>
>>> *Malithi Edirisinghe*
>>> Senior Software Engineer
>>> WSO2 Inc.
>>>
>>> Mobile : +94 (0) 718176807
>>>  [email protected]
>>>
>>
>>
>>
>> --
>> ============================
>> Director, Research, WSO2 Inc.
>> Visiting Faculty, University of Moratuwa
>> Member, Apache Software Foundation
>> Research Scientist, Lanka Software Foundation
>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>> Site: http://people.apache.org/~hemapani/
>> Photos: http://www.flickr.com/photos/hemapani/
>> Phone: 0772360902
>>
>
>
>
> --
>
> *S. Suhothayan*
> Technical Lead & Team Lead of WSO2 Complex Event Processor
>  *WSO2 Inc. *http://wso2.com
> * <http://wso2.com/>*
> lean . enterprise . middleware
>
>
> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter:
> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
> http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
>

-- 
Chanuka Dissanayake
*Software Engineer | **WSO2 Inc.*; http://wso2.com

Mobile: +94 71 33 63 596
Email: [email protected]

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [CEP] NLP Toolbox

Reply via email to