Re: [Architecture] [CEP] NLP Toolbox

Srinath Perera Mon, 01 Sep 2014 02:13:36 -0700

How about 2pm? (Someone had a conflict in the AM)


On Mon, Sep 1, 2014 at 2:40 PM, Srinath Perera <[email protected]> wrote:

> Can we meet and discuss? How about tomorrow 11am?
>
>
> On Thu, Aug 28, 2014 at 6:49 PM, Malithi Edirisinghe <[email protected]>
> wrote:
>
>> Hi,
>>
>> I have looked at how Stanford NLP extract grammatical dependencies in
>> detail and have following concerns with regard to the implementation of 3rd
>> query(findRelationship(sentence, regex)).
>>
>> When a sentence is given Stanford NLP can recognise around 50 grammatical
>> relationships. I have listed some with simple examples below.
>>
>>
>>    - acomp:adjective complement
>>
>> This is an adjectival phrase which functions as the complement (like an
>> object of the verb).
>>
>> ex:
>>
>> “She looks very beautiful” -> acomp(looks, beautiful)
>>
>>
>>    - agent
>>
>> This is a complement of a passive verb which is introduced by the
>> preposition “by” and does the action.
>>
>> ex:
>>
>> “The man has been killed by the police” -> agent(killed, police)
>> “Effects caused by the protein are important” -> agent(caused, protein)
>>
>>
>>    - aux:auxiliary
>>
>> This is the non-main verb of the clause
>>
>> ex:
>>
>> "Reagan has died" -> aux(died, has)
>> "He should leave" -> aux(leave,should)
>>
>>
>>    - conj:conjunct
>>
>> This is the relation between two elements connected by a coordinating
>> conjunction, such as “and”, “or”, etc.
>>
>> ex:
>>
>> “Bill is big and honest” -> conj(big, honest)
>> “They either ski or snowboard” -> conj(ski, snowboard)
>>
>>
>>    - dobj:direct object
>>
>>  This is the noun phrase which is the object of the verb.
>>
>>  ex:
>>
>>  “They win the lottery” -> dobj(win, lottery)
>>
>>
>>    -  nsubj:nominal subject
>>
>>  This is a noun phrase which is the syntactic subject of a clause.
>>
>>  ex:
>>  “The baby is cute” -> nsubj(cute, baby)
>>
>>  With this library support, I would like to clarify on following.
>>
>>    1.  How should we use the regular expression to extract the
>>    relationship while the library is extracting relationships itself?
>>    2. What kind of relationships should we extract, for an example is it
>>    just simple relationships as identifying the subject, verb and object or
>>    any other?
>>
>>
>>  Kindly expect your thoughts on this.
>>
>>  Thanks,
>>  Malithi.
>>
>>
>>
>> On Fri, Aug 22, 2014 at 6:11 PM, Malithi Edirisinghe <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> We started the implementation with Stanford NLP due to reasons below.
>>>
>>> 1. Stanford NLP provides a rich regular expression support in writing
>>> patterns over tokens, rather than working at character level with normal
>>> java regular expressions.
>>>
>>> 2. Stanford NLP can extract grammatical relationships from the parsed
>>> tree thus we can easily implement the 3rd query.
>>>
>>> Thanks,
>>>
>>> Malithi.
>>>
>>>
>>> On Thu, Aug 21, 2014 at 12:58 PM, Malithi Edirisinghe <[email protected]
>>> > wrote:
>>>
>>>> Hi Suho,
>>>>
>>>> Since Named Entity Recognition is supported by both libraries we can
>>>> implement the first function from any of them. Both can identify entities
>>>> like person, location, organization, etc. For the fourth function we found
>>>> a way that we can simply define dictionaries in openNLP. There is a class
>>>> called  DictionaryNameFinder which takes a Dictionary and identify any
>>>> matching entry in the sentence with the dictionary. In Stanford NLP, we
>>>> could find that there is an implementation for a Dictionary; but yet we
>>>> couldn't find a way of using
>>>> that for our requirement. It lacks samples, and seems like we should
>>>> look into their code to find how they have used it. We will work on it.
>>>> Anyhow I think it should be possible to define such Dictionary in Stanford
>>>> NLP also.
>>>>
>>>> Thanks,
>>>> Malithi.
>>>>
>>>>
>>>> On Thu, Aug 21, 2014 at 10:09 AM, Sriskandarajah Suhothayan <
>>>> [email protected]> wrote:
>>>>
>>>>> Thats a good compression.
>>>>> Based on this I believe we have issues in implementing functions 2 & 3
>>>>> using OpenNLP.
>>>>> Can you evaluate others functions as well.
>>>>>
>>>>> Suho
>>>>>
>>>>>
>>>>> On Thu, Aug 21, 2014 at 9:54 AM, Chanuka Dissanayake <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> We did a study on both OpenNLP and Stanford NLP libraries and looked
>>>>>> at the features that could support our implementation.
>>>>>> Our findings are summarised below.
>>>>>>
>>>>>> It seems that Stanford NLP has better capabilities when considering
>>>>>> support for regular expressons and parsing.
>>>>>> We would like to discuss this further and choose the appropriate
>>>>>>
>>>>>>
>>>>>>    Feature OpenNLP StanfordNLP  Named Entity Recognizer Will
>>>>>> identify the person,location,organization,time,date,money,percentage 
>>>>>> inside
>>>>>> the given sentence but sentence need to be tokenized first. Includes
>>>>>> a 4 class model trained for CoNLL, a 7 class model trained for MUC, and 
>>>>>> a 3
>>>>>> class model trained on both data sets for the intersection of those class
>>>>>> sets.
>>>>>> 3 class: Location, Person, Organization
>>>>>> 4 class: Location, Person, Organization, Misc
>>>>>> 7 class: Time, Location, Organization, Person, Money, Percent, Date
>>>>>>  POS Tagger Identify:
>>>>>> VP(Verb Phrase) ,NP(Noun Phrase) ,JJ(Adjective)…etc
>>>>>>
>>>>>> Input: Hi. How are you? This is Mike
>>>>>> output: Hi_NNP How_WRB are_VBP you? _JJ This_DT is_VBZ Mike._NNP Label
>>>>>> each token with the POS Tag, such as noun, verb, adjective, etc.,
>>>>>> Tokenizing Separates the words which have white spaces in-between by
>>>>>> default. Otherwise it can be trained to tokanize by different options. 
>>>>>> Can
>>>>>> tokenize the text either by whitespace or as per the options defined
>>>>>> Parsing Once given a tokanized sentence, It will construct the tree
>>>>>> structure. This works out the grammatical structure of sentences in
>>>>>> a tree structure. The parser provides Stanford Dependencies as well. They
>>>>>> represent the grammatical relations between words in a sentence.
>>>>>> Dependecies are triplets: name of the relation, governor and dependent.
>>>>>> Ex: Bell, based in Los Angeles, makes and distributes electronic,
>>>>>> computer and building products.
>>>>>> Dependency: nsubj(distributes-10, Bell-1)
>>>>>> This is like saying “the subject of distributes is Bell.”  Sentence
>>>>>> Detection Detect sentence boundaries given a paragraph. Available as
>>>>>> ssplit. Can split sentences as per the options defined  Regular
>>>>>> Expressions Character wise regular expression only. Cannot identify
>>>>>> named entities or PoS tags via regular expression Two tools are
>>>>>> provided to deal with regular expressions.
>>>>>> RegexNER:Can define simple rules with regular expressions and label
>>>>>> entities with NE labels that are not provided.
>>>>>> Ex: Bachelor of (Arts|Laws|Science|Engineering) DEGREE
>>>>>> This rule will label tokens matching with the regex in first column
>>>>>> as DEGREE
>>>>>> TokensRegex: Can identify patterns over a list of tokens. In addition
>>>>>> to java regex matching this provides syntax to match part of speech tags,
>>>>>> named entity tags and lemma.
>>>>>>  Ex: [ { tag:VBD } ], /University/ /of/ [{ ner:LOCATION }]
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Chanuka.
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 19, 2014 at 11:11 PM, Sriskandarajah Suhothayan <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> +1 looks good
>>>>>>>
>>>>>>> Suho
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 19, 2014 at 9:56 PM, Srinath Perera <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Look good. If possible we should do this with OpenNLP as it has
>>>>>>>> apache licence. However, I could not find NLP regex impl there. Please 
>>>>>>>> look
>>>>>>>> at it in detial.
>>>>>>>>
>>>>>>>> --Srinath
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 19, 2014 at 9:52 PM, Malithi Edirisinghe <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> We are working on a NLP Toolbox improvement in CEP. The main idea
>>>>>>>>> of this improvement is to use a NLP library and let user do some NLP
>>>>>>>>> operations as Siddhi extensions.
>>>>>>>>>
>>>>>>>>> So in our implementation we have decided to support following NLP
>>>>>>>>> operations.
>>>>>>>>>
>>>>>>>>> *1. findNameEntityType(sentence, entityType)*
>>>>>>>>>
>>>>>>>>> *Description:*
>>>>>>>>>
>>>>>>>>> This operation takes a sentence and a predefined entity type as
>>>>>>>>> it's inputs. It will return noun(s) in the sentence that match the 
>>>>>>>>> defined
>>>>>>>>> entity type, as event(s).
>>>>>>>>>
>>>>>>>>> *inputs:*
>>>>>>>>>
>>>>>>>>> sentence  : sentence to be processed
>>>>>>>>> entityType: predefined entity type
>>>>>>>>>  ORGANIZATION
>>>>>>>>> NAME
>>>>>>>>>  LOCATION
>>>>>>>>>  *output:*
>>>>>>>>>
>>>>>>>>> matching noun(s) as event(s)
>>>>>>>>>
>>>>>>>>> *example:*
>>>>>>>>>
>>>>>>>>>  inputs:
>>>>>>>>> sentence   : Alice works at WSO2
>>>>>>>>>  entityType : NAME
>>>>>>>>>
>>>>>>>>>  output: Alice
>>>>>>>>>
>>>>>>>>> *2. findNLRegexPattern(sentence, regex)*
>>>>>>>>>
>>>>>>>>> *Description:*
>>>>>>>>>
>>>>>>>>> This operation takes a sentence and a regular expression as it's
>>>>>>>>> inputs. It will return each match in the sentence, as an event.
>>>>>>>>>
>>>>>>>>> *inputs:*
>>>>>>>>>
>>>>>>>>> sentence  : sentence to be processed
>>>>>>>>> regex       : regular expression to be matched
>>>>>>>>>  *output:*
>>>>>>>>>
>>>>>>>>> matching pharase(s) as event(s)
>>>>>>>>>
>>>>>>>>> *example:*
>>>>>>>>>
>>>>>>>>> inputs:
>>>>>>>>>  sentence   : WSO2 was found in 2005
>>>>>>>>>  regex        : \\d{4}
>>>>>>>>>
>>>>>>>>>  output: 2005
>>>>>>>>>
>>>>>>>>> *3. findRelationship(sentence, regex)*
>>>>>>>>>
>>>>>>>>> *Description:*
>>>>>>>>>
>>>>>>>>> This operation takes a sentence and a regular expression as it's
>>>>>>>>> inputs. For each relationship extracted from the regular expression 
>>>>>>>>> the
>>>>>>>>> operation will return a triplet; subject, object and relationship as 
>>>>>>>>> an
>>>>>>>>> event.
>>>>>>>>>
>>>>>>>>> *inputs:*
>>>>>>>>>
>>>>>>>>> sentence  : sentence to be processed
>>>>>>>>> regex       : regular expression to extract the relationship
>>>>>>>>>  *output:*
>>>>>>>>>
>>>>>>>>> triplet(s) of (subject, object, relationship) as event(s)
>>>>>>>>>
>>>>>>>>> *example:*
>>>>>>>>>
>>>>>>>>>  inputs:
>>>>>>>>> sentence   : Bob works for WSO2
>>>>>>>>>  regex        : works for
>>>>>>>>>
>>>>>>>>>  output: (Bob, WSO2, works for)
>>>>>>>>>  *4. findNameEntityTypeViaDictionary(sentence, dictionary,
>>>>>>>>> entityType)*
>>>>>>>>>
>>>>>>>>> *Description:*
>>>>>>>>>
>>>>>>>>> This operation takes a sentence, dictionary file and a predefined
>>>>>>>>> entity type as it's inputs. It will return noun(s) in the sentence of 
>>>>>>>>> the
>>>>>>>>> defined entity type, that also exists in the dictionary as event(s).
>>>>>>>>>
>>>>>>>>> *inputs:*
>>>>>>>>>
>>>>>>>>> sentence   : sentence to be processed
>>>>>>>>> dictionary  : dictionary of entities of the defined entity type
>>>>>>>>> entityType : predefined entity type
>>>>>>>>>  ORGANIZATION
>>>>>>>>>   NAME
>>>>>>>>>  LOCATION
>>>>>>>>>  *output:*
>>>>>>>>>
>>>>>>>>> matching noun(s) as event(s)
>>>>>>>>>
>>>>>>>>> *example:*
>>>>>>>>>
>>>>>>>>>  inputs:
>>>>>>>>> sentence    : Bob works at WSO2
>>>>>>>>>  dictionary   : (WSO2,ORACLE,IBM)
>>>>>>>>> entityType  : ORGANIZATION
>>>>>>>>>
>>>>>>>>> output: WSO2
>>>>>>>>>
>>>>>>>>> Each NLP operation defined here will be implemented as a
>>>>>>>>> transformer extension to Siddhi.
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> *Malithi Edirisinghe*
>>>>>>>>> Senior Software Engineer
>>>>>>>>> WSO2 Inc.
>>>>>>>>>
>>>>>>>>> Mobile : +94 (0) 718176807
>>>>>>>>>  [email protected]
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> ============================
>>>>>>>> Director, Research, WSO2 Inc.
>>>>>>>> Visiting Faculty, University of Moratuwa
>>>>>>>> Member, Apache Software Foundation
>>>>>>>> Research Scientist, Lanka Software Foundation
>>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>>>> Phone: 0772360902
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> *S. Suhothayan*
>>>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>>>  *WSO2 Inc. *http://wso2.com
>>>>>>> * <http://wso2.com/>*
>>>>>>> lean . enterprise . middleware
>>>>>>>
>>>>>>>
>>>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> 
>>>>>>> twitter:
>>>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | 
>>>>>>> linked-in:
>>>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Chanuka Dissanayake
>>>>>> *Software Engineer | **WSO2 Inc.*; http://wso2.com
>>>>>>
>>>>>> Mobile: +94 71 33 63 596
>>>>>> Email: [email protected]
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> *S. Suhothayan*
>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>  *WSO2 Inc. *http://wso2.com
>>>>> * <http://wso2.com/>*
>>>>> lean . enterprise . middleware
>>>>>
>>>>>
>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter:
>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *Malithi Edirisinghe*
>>>> Senior Software Engineer
>>>> WSO2 Inc.
>>>>
>>>> Mobile : +94 (0) 718176807
>>>> [email protected]
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Malithi Edirisinghe*
>>> Senior Software Engineer
>>> WSO2 Inc.
>>>
>>> Mobile : +94 (0) 718176807
>>> [email protected]
>>>
>>
>>
>>
>> --
>>
>> *Malithi Edirisinghe*
>> Senior Software Engineer
>> WSO2 Inc.
>>
>> Mobile : +94 (0) 718176807
>> [email protected]
>>
>
>
>
> --
> ============================
> Director, Research, WSO2 Inc.
> Visiting Faculty, University of Moratuwa
> Member, Apache Software Foundation
> Research Scientist, Lanka Software Foundation
> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
> Site: http://people.apache.org/~hemapani/
> Photos: http://www.flickr.com/photos/hemapani/
> Phone: 0772360902
>



-- 
============================
Director, Research, WSO2 Inc.
Visiting Faculty, University of Moratuwa
Member, Apache Software Foundation
Research Scientist, Lanka Software Foundation
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [CEP] NLP Toolbox

Reply via email to