Re: [Architecture] [CEP] NLP Toolbox

Srinath Perera Mon, 01 Sep 2014 02:12:12 -0700

Can we meet and discuss? How about tomorrow 11am?


On Thu, Aug 28, 2014 at 6:49 PM, Malithi Edirisinghe <[email protected]>
wrote:

> Hi,
>
> I have looked at how Stanford NLP extract grammatical dependencies in
> detail and have following concerns with regard to the implementation of 3rd
> query(findRelationship(sentence, regex)).
>
> When a sentence is given Stanford NLP can recognise around 50 grammatical
> relationships. I have listed some with simple examples below.
>
>
>    - acomp:adjective complement
>
> This is an adjectival phrase which functions as the complement (like an
> object of the verb).
>
> ex:
>
> “She looks very beautiful” -> acomp(looks, beautiful)
>
>
>    - agent
>
> This is a complement of a passive verb which is introduced by the
> preposition “by” and does the action.
>
> ex:
>
> “The man has been killed by the police” -> agent(killed, police)
> “Effects caused by the protein are important” -> agent(caused, protein)
>
>
>    - aux:auxiliary
>
> This is the non-main verb of the clause
>
> ex:
>
> "Reagan has died" -> aux(died, has)
> "He should leave" -> aux(leave,should)
>
>
>    - conj:conjunct
>
> This is the relation between two elements connected by a coordinating
> conjunction, such as “and”, “or”, etc.
>
> ex:
>
> “Bill is big and honest” -> conj(big, honest)
> “They either ski or snowboard” -> conj(ski, snowboard)
>
>
>    - dobj:direct object
>
>  This is the noun phrase which is the object of the verb.
>
>  ex:
>
>  “They win the lottery” -> dobj(win, lottery)
>
>
>    -  nsubj:nominal subject
>
>  This is a noun phrase which is the syntactic subject of a clause.
>
>  ex:
>  “The baby is cute” -> nsubj(cute, baby)
>
>  With this library support, I would like to clarify on following.
>
>    1.  How should we use the regular expression to extract the
>    relationship while the library is extracting relationships itself?
>    2. What kind of relationships should we extract, for an example is it
>    just simple relationships as identifying the subject, verb and object or
>    any other?
>
>
>  Kindly expect your thoughts on this.
>
>  Thanks,
>  Malithi.
>
>
>
> On Fri, Aug 22, 2014 at 6:11 PM, Malithi Edirisinghe <[email protected]>
> wrote:
>
>> Hi,
>>
>> We started the implementation with Stanford NLP due to reasons below.
>>
>> 1. Stanford NLP provides a rich regular expression support in writing
>> patterns over tokens, rather than working at character level with normal
>> java regular expressions.
>>
>> 2. Stanford NLP can extract grammatical relationships from the parsed
>> tree thus we can easily implement the 3rd query.
>>
>> Thanks,
>>
>> Malithi.
>>
>>
>> On Thu, Aug 21, 2014 at 12:58 PM, Malithi Edirisinghe <[email protected]>
>> wrote:
>>
>>> Hi Suho,
>>>
>>> Since Named Entity Recognition is supported by both libraries we can
>>> implement the first function from any of them. Both can identify entities
>>> like person, location, organization, etc. For the fourth function we found
>>> a way that we can simply define dictionaries in openNLP. There is a class
>>> called  DictionaryNameFinder which takes a Dictionary and identify any
>>> matching entry in the sentence with the dictionary. In Stanford NLP, we
>>> could find that there is an implementation for a Dictionary; but yet we
>>> couldn't find a way of using
>>> that for our requirement. It lacks samples, and seems like we should
>>> look into their code to find how they have used it. We will work on it.
>>> Anyhow I think it should be possible to define such Dictionary in Stanford
>>> NLP also.
>>>
>>> Thanks,
>>> Malithi.
>>>
>>>
>>> On Thu, Aug 21, 2014 at 10:09 AM, Sriskandarajah Suhothayan <
>>> [email protected]> wrote:
>>>
>>>> Thats a good compression.
>>>> Based on this I believe we have issues in implementing functions 2 & 3
>>>> using OpenNLP.
>>>> Can you evaluate others functions as well.
>>>>
>>>> Suho
>>>>
>>>>
>>>> On Thu, Aug 21, 2014 at 9:54 AM, Chanuka Dissanayake <[email protected]>
>>>> wrote:
>>>>
>>>>> We did a study on both OpenNLP and Stanford NLP libraries and looked
>>>>> at the features that could support our implementation.
>>>>> Our findings are summarised below.
>>>>>
>>>>> It seems that Stanford NLP has better capabilities when considering
>>>>> support for regular expressons and parsing.
>>>>> We would like to discuss this further and choose the appropriate
>>>>>
>>>>>
>>>>>    Feature OpenNLP StanfordNLP  Named Entity Recognizer Will identify
>>>>> the person,location,organization,time,date,money,percentage inside the
>>>>> given sentence but sentence need to be tokenized first. Includes a 4
>>>>> class model trained for CoNLL, a 7 class model trained for MUC, and a 3
>>>>> class model trained on both data sets for the intersection of those class
>>>>> sets.
>>>>> 3 class: Location, Person, Organization
>>>>> 4 class: Location, Person, Organization, Misc
>>>>> 7 class: Time, Location, Organization, Person, Money, Percent, Date
>>>>>  POS Tagger Identify:
>>>>> VP(Verb Phrase) ,NP(Noun Phrase) ,JJ(Adjective)…etc
>>>>>
>>>>> Input: Hi. How are you? This is Mike
>>>>> output: Hi_NNP How_WRB are_VBP you? _JJ This_DT is_VBZ Mike._NNP Label
>>>>> each token with the POS Tag, such as noun, verb, adjective, etc.,
>>>>> Tokenizing Separates the words which have white spaces in-between by
>>>>> default. Otherwise it can be trained to tokanize by different options. Can
>>>>> tokenize the text either by whitespace or as per the options defined
>>>>> Parsing Once given a tokanized sentence, It will construct the tree
>>>>> structure. This works out the grammatical structure of sentences in a
>>>>> tree structure. The parser provides Stanford Dependencies as well. They
>>>>> represent the grammatical relations between words in a sentence.
>>>>> Dependecies are triplets: name of the relation, governor and dependent.
>>>>> Ex: Bell, based in Los Angeles, makes and distributes electronic,
>>>>> computer and building products.
>>>>> Dependency: nsubj(distributes-10, Bell-1)
>>>>> This is like saying “the subject of distributes is Bell.”  Sentence
>>>>> Detection Detect sentence boundaries given a paragraph. Available as
>>>>> ssplit. Can split sentences as per the options defined  Regular
>>>>> Expressions Character wise regular expression only. Cannot identify
>>>>> named entities or PoS tags via regular expression Two tools are
>>>>> provided to deal with regular expressions.
>>>>> RegexNER:Can define simple rules with regular expressions and label
>>>>> entities with NE labels that are not provided.
>>>>> Ex: Bachelor of (Arts|Laws|Science|Engineering) DEGREE
>>>>> This rule will label tokens matching with the regex in first column as
>>>>> DEGREE
>>>>> TokensRegex: Can identify patterns over a list of tokens. In addition
>>>>> to java regex matching this provides syntax to match part of speech tags,
>>>>> named entity tags and lemma.
>>>>>  Ex: [ { tag:VBD } ], /University/ /of/ [{ ner:LOCATION }]
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Chanuka.
>>>>>
>>>>>
>>>>> On Tue, Aug 19, 2014 at 11:11 PM, Sriskandarajah Suhothayan <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> +1 looks good
>>>>>>
>>>>>> Suho
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 19, 2014 at 9:56 PM, Srinath Perera <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Look good. If possible we should do this with OpenNLP as it has
>>>>>>> apache licence. However, I could not find NLP regex impl there. Please 
>>>>>>> look
>>>>>>> at it in detial.
>>>>>>>
>>>>>>> --Srinath
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 19, 2014 at 9:52 PM, Malithi Edirisinghe <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We are working on a NLP Toolbox improvement in CEP. The main idea
>>>>>>>> of this improvement is to use a NLP library and let user do some NLP
>>>>>>>> operations as Siddhi extensions.
>>>>>>>>
>>>>>>>> So in our implementation we have decided to support following NLP
>>>>>>>> operations.
>>>>>>>>
>>>>>>>> *1. findNameEntityType(sentence, entityType)*
>>>>>>>>
>>>>>>>> *Description:*
>>>>>>>>
>>>>>>>> This operation takes a sentence and a predefined entity type as
>>>>>>>> it's inputs. It will return noun(s) in the sentence that match the 
>>>>>>>> defined
>>>>>>>> entity type, as event(s).
>>>>>>>>
>>>>>>>> *inputs:*
>>>>>>>>
>>>>>>>> sentence  : sentence to be processed
>>>>>>>> entityType: predefined entity type
>>>>>>>>  ORGANIZATION
>>>>>>>> NAME
>>>>>>>>  LOCATION
>>>>>>>>  *output:*
>>>>>>>>
>>>>>>>> matching noun(s) as event(s)
>>>>>>>>
>>>>>>>> *example:*
>>>>>>>>
>>>>>>>>  inputs:
>>>>>>>> sentence   : Alice works at WSO2
>>>>>>>>  entityType : NAME
>>>>>>>>
>>>>>>>>  output: Alice
>>>>>>>>
>>>>>>>> *2. findNLRegexPattern(sentence, regex)*
>>>>>>>>
>>>>>>>> *Description:*
>>>>>>>>
>>>>>>>> This operation takes a sentence and a regular expression as it's
>>>>>>>> inputs. It will return each match in the sentence, as an event.
>>>>>>>>
>>>>>>>> *inputs:*
>>>>>>>>
>>>>>>>> sentence  : sentence to be processed
>>>>>>>> regex       : regular expression to be matched
>>>>>>>>  *output:*
>>>>>>>>
>>>>>>>> matching pharase(s) as event(s)
>>>>>>>>
>>>>>>>> *example:*
>>>>>>>>
>>>>>>>> inputs:
>>>>>>>>  sentence   : WSO2 was found in 2005
>>>>>>>>  regex        : \\d{4}
>>>>>>>>
>>>>>>>>  output: 2005
>>>>>>>>
>>>>>>>> *3. findRelationship(sentence, regex)*
>>>>>>>>
>>>>>>>> *Description:*
>>>>>>>>
>>>>>>>> This operation takes a sentence and a regular expression as it's
>>>>>>>> inputs. For each relationship extracted from the regular expression the
>>>>>>>> operation will return a triplet; subject, object and relationship as an
>>>>>>>> event.
>>>>>>>>
>>>>>>>> *inputs:*
>>>>>>>>
>>>>>>>> sentence  : sentence to be processed
>>>>>>>> regex       : regular expression to extract the relationship
>>>>>>>>  *output:*
>>>>>>>>
>>>>>>>> triplet(s) of (subject, object, relationship) as event(s)
>>>>>>>>
>>>>>>>> *example:*
>>>>>>>>
>>>>>>>>  inputs:
>>>>>>>> sentence   : Bob works for WSO2
>>>>>>>>  regex        : works for
>>>>>>>>
>>>>>>>>  output: (Bob, WSO2, works for)
>>>>>>>>  *4. findNameEntityTypeViaDictionary(sentence, dictionary,
>>>>>>>> entityType)*
>>>>>>>>
>>>>>>>> *Description:*
>>>>>>>>
>>>>>>>> This operation takes a sentence, dictionary file and a predefined
>>>>>>>> entity type as it's inputs. It will return noun(s) in the sentence of 
>>>>>>>> the
>>>>>>>> defined entity type, that also exists in the dictionary as event(s).
>>>>>>>>
>>>>>>>> *inputs:*
>>>>>>>>
>>>>>>>> sentence   : sentence to be processed
>>>>>>>> dictionary  : dictionary of entities of the defined entity type
>>>>>>>> entityType : predefined entity type
>>>>>>>>  ORGANIZATION
>>>>>>>>   NAME
>>>>>>>>  LOCATION
>>>>>>>>  *output:*
>>>>>>>>
>>>>>>>> matching noun(s) as event(s)
>>>>>>>>
>>>>>>>> *example:*
>>>>>>>>
>>>>>>>>  inputs:
>>>>>>>> sentence    : Bob works at WSO2
>>>>>>>>  dictionary   : (WSO2,ORACLE,IBM)
>>>>>>>> entityType  : ORGANIZATION
>>>>>>>>
>>>>>>>> output: WSO2
>>>>>>>>
>>>>>>>> Each NLP operation defined here will be implemented as a
>>>>>>>> transformer extension to Siddhi.
>>>>>>>> --
>>>>>>>>
>>>>>>>> *Malithi Edirisinghe*
>>>>>>>> Senior Software Engineer
>>>>>>>> WSO2 Inc.
>>>>>>>>
>>>>>>>> Mobile : +94 (0) 718176807
>>>>>>>>  [email protected]
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ============================
>>>>>>> Director, Research, WSO2 Inc.
>>>>>>> Visiting Faculty, University of Moratuwa
>>>>>>> Member, Apache Software Foundation
>>>>>>> Research Scientist, Lanka Software Foundation
>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>>> Phone: 0772360902
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> *S. Suhothayan*
>>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>>  *WSO2 Inc. *http://wso2.com
>>>>>> * <http://wso2.com/>*
>>>>>> lean . enterprise . middleware
>>>>>>
>>>>>>
>>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> 
>>>>>> twitter:
>>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | 
>>>>>> linked-in:
>>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chanuka Dissanayake
>>>>> *Software Engineer | **WSO2 Inc.*; http://wso2.com
>>>>>
>>>>> Mobile: +94 71 33 63 596
>>>>> Email: [email protected]
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *S. Suhothayan*
>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>  *WSO2 Inc. *http://wso2.com
>>>> * <http://wso2.com/>*
>>>> lean . enterprise . middleware
>>>>
>>>>
>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter:
>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>>>> http://lk.linkedin.com/in/suhothayan 
>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Malithi Edirisinghe*
>>> Senior Software Engineer
>>> WSO2 Inc.
>>>
>>> Mobile : +94 (0) 718176807
>>> [email protected]
>>>
>>
>>
>>
>> --
>>
>> *Malithi Edirisinghe*
>> Senior Software Engineer
>> WSO2 Inc.
>>
>> Mobile : +94 (0) 718176807
>> [email protected]
>>
>
>
>
> --
>
> *Malithi Edirisinghe*
> Senior Software Engineer
> WSO2 Inc.
>
> Mobile : +94 (0) 718176807
> [email protected]
>



-- 
============================
Director, Research, WSO2 Inc.
Visiting Faculty, University of Moratuwa
Member, Apache Software Foundation
Research Scientist, Lanka Software Foundation
Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
Site: http://people.apache.org/~hemapani/
Photos: http://www.flickr.com/photos/hemapani/
Phone: 0772360902

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [CEP] NLP Toolbox

Reply via email to