Re: [Architecture] [CEP] NLP Toolbox

Malithi Edirisinghe Thu, 28 Aug 2014 06:19:58 -0700

Hi,

I have looked at how Stanford NLP extract grammatical dependencies in
detail and have following concerns with regard to the implementation of 3rd
query(findRelationship(sentence, regex)).


When a sentence is given Stanford NLP can recognise around 50 grammatical
relationships. I have listed some with simple examples below.


   - acomp:adjective complement

This is an adjectival phrase which functions as the complement (like an
object of the verb).

ex:

“She looks very beautiful” -> acomp(looks, beautiful)


   - agent

This is a complement of a passive verb which is introduced by the
preposition “by” and does the action.

ex:

“The man has been killed by the police” -> agent(killed, police)
“Effects caused by the protein are important” -> agent(caused, protein)


   - aux:auxiliary

This is the non-main verb of the clause

ex:

"Reagan has died" -> aux(died, has)
"He should leave" -> aux(leave,should)


   - conj:conjunct

This is the relation between two elements connected by a coordinating
conjunction, such as “and”, “or”, etc.

ex:

“Bill is big and honest” -> conj(big, honest)
“They either ski or snowboard” -> conj(ski, snowboard)


   - dobj:direct object

 This is the noun phrase which is the object of the verb.

 ex:

 “They win the lottery” -> dobj(win, lottery)


   -  nsubj:nominal subject

 This is a noun phrase which is the syntactic subject of a clause.

 ex:
 “The baby is cute” -> nsubj(cute, baby)

 With this library support, I would like to clarify on following.

   1.  How should we use the regular expression to extract the relationship
   while the library is extracting relationships itself?
   2. What kind of relationships should we extract, for an example is it
   just simple relationships as identifying the subject, verb and object or
   any other?


 Kindly expect your thoughts on this.

 Thanks,
 Malithi.



On Fri, Aug 22, 2014 at 6:11 PM, Malithi Edirisinghe <[email protected]>
wrote:

> Hi,
>
> We started the implementation with Stanford NLP due to reasons below.
>
> 1. Stanford NLP provides a rich regular expression support in writing
> patterns over tokens, rather than working at character level with normal
> java regular expressions.
>
> 2. Stanford NLP can extract grammatical relationships from the parsed tree
> thus we can easily implement the 3rd query.
>
> Thanks,
>
> Malithi.
>
>
> On Thu, Aug 21, 2014 at 12:58 PM, Malithi Edirisinghe <[email protected]>
> wrote:
>
>> Hi Suho,
>>
>> Since Named Entity Recognition is supported by both libraries we can
>> implement the first function from any of them. Both can identify entities
>> like person, location, organization, etc. For the fourth function we found
>> a way that we can simply define dictionaries in openNLP. There is a class
>> called  DictionaryNameFinder which takes a Dictionary and identify any
>> matching entry in the sentence with the dictionary. In Stanford NLP, we
>> could find that there is an implementation for a Dictionary; but yet we
>> couldn't find a way of using
>> that for our requirement. It lacks samples, and seems like we should look
>> into their code to find how they have used it. We will work on it. Anyhow I
>> think it should be possible to define such Dictionary in Stanford NLP also.
>>
>> Thanks,
>> Malithi.
>>
>>
>> On Thu, Aug 21, 2014 at 10:09 AM, Sriskandarajah Suhothayan <
>> [email protected]> wrote:
>>
>>> Thats a good compression.
>>> Based on this I believe we have issues in implementing functions 2 & 3
>>> using OpenNLP.
>>> Can you evaluate others functions as well.
>>>
>>> Suho
>>>
>>>
>>> On Thu, Aug 21, 2014 at 9:54 AM, Chanuka Dissanayake <[email protected]>
>>> wrote:
>>>
>>>> We did a study on both OpenNLP and Stanford NLP libraries and looked at
>>>> the features that could support our implementation.
>>>> Our findings are summarised below.
>>>>
>>>> It seems that Stanford NLP has better capabilities when considering
>>>> support for regular expressons and parsing.
>>>> We would like to discuss this further and choose the appropriate
>>>>
>>>>
>>>>    Feature OpenNLP StanfordNLP  Named Entity Recognizer Will identify
>>>> the person,location,organization,time,date,money,percentage inside the
>>>> given sentence but sentence need to be tokenized first. Includes a 4
>>>> class model trained for CoNLL, a 7 class model trained for MUC, and a 3
>>>> class model trained on both data sets for the intersection of those class
>>>> sets.
>>>> 3 class: Location, Person, Organization
>>>> 4 class: Location, Person, Organization, Misc
>>>> 7 class: Time, Location, Organization, Person, Money, Percent, Date
>>>>  POS Tagger Identify:
>>>> VP(Verb Phrase) ,NP(Noun Phrase) ,JJ(Adjective)…etc
>>>>
>>>> Input: Hi. How are you? This is Mike
>>>> output: Hi_NNP How_WRB are_VBP you? _JJ This_DT is_VBZ Mike._NNP Label
>>>> each token with the POS Tag, such as noun, verb, adjective, etc.,
>>>> Tokenizing Separates the words which have white spaces in-between by
>>>> default. Otherwise it can be trained to tokanize by different options. Can
>>>> tokenize the text either by whitespace or as per the options defined
>>>> Parsing Once given a tokanized sentence, It will construct the tree
>>>> structure. This works out the grammatical structure of sentences in a
>>>> tree structure. The parser provides Stanford Dependencies as well. They
>>>> represent the grammatical relations between words in a sentence.
>>>> Dependecies are triplets: name of the relation, governor and dependent.
>>>> Ex: Bell, based in Los Angeles, makes and distributes electronic,
>>>> computer and building products.
>>>> Dependency: nsubj(distributes-10, Bell-1)
>>>> This is like saying “the subject of distributes is Bell.”  Sentence
>>>> Detection Detect sentence boundaries given a paragraph. Available as
>>>> ssplit. Can split sentences as per the options defined  Regular
>>>> Expressions Character wise regular expression only. Cannot identify
>>>> named entities or PoS tags via regular expression Two tools are
>>>> provided to deal with regular expressions.
>>>> RegexNER:Can define simple rules with regular expressions and label
>>>> entities with NE labels that are not provided.
>>>> Ex: Bachelor of (Arts|Laws|Science|Engineering) DEGREE
>>>> This rule will label tokens matching with the regex in first column as
>>>> DEGREE
>>>> TokensRegex: Can identify patterns over a list of tokens. In addition
>>>> to java regex matching this provides syntax to match part of speech tags,
>>>> named entity tags and lemma.
>>>>  Ex: [ { tag:VBD } ], /University/ /of/ [{ ner:LOCATION }]
>>>>
>>>>
>>>> Thanks,
>>>> Chanuka.
>>>>
>>>>
>>>> On Tue, Aug 19, 2014 at 11:11 PM, Sriskandarajah Suhothayan <
>>>> [email protected]> wrote:
>>>>
>>>>> +1 looks good
>>>>>
>>>>> Suho
>>>>>
>>>>>
>>>>> On Tue, Aug 19, 2014 at 9:56 PM, Srinath Perera <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Look good. If possible we should do this with OpenNLP as it has
>>>>>> apache licence. However, I could not find NLP regex impl there. Please 
>>>>>> look
>>>>>> at it in detial.
>>>>>>
>>>>>> --Srinath
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 19, 2014 at 9:52 PM, Malithi Edirisinghe <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We are working on a NLP Toolbox improvement in CEP. The main idea of
>>>>>>> this improvement is to use a NLP library and let user do some NLP
>>>>>>> operations as Siddhi extensions.
>>>>>>>
>>>>>>> So in our implementation we have decided to support following NLP
>>>>>>> operations.
>>>>>>>
>>>>>>> *1. findNameEntityType(sentence, entityType)*
>>>>>>>
>>>>>>> *Description:*
>>>>>>>
>>>>>>> This operation takes a sentence and a predefined entity type as it's
>>>>>>> inputs. It will return noun(s) in the sentence that match the defined
>>>>>>> entity type, as event(s).
>>>>>>>
>>>>>>> *inputs:*
>>>>>>>
>>>>>>> sentence  : sentence to be processed
>>>>>>> entityType: predefined entity type
>>>>>>>  ORGANIZATION
>>>>>>> NAME
>>>>>>>  LOCATION
>>>>>>>  *output:*
>>>>>>>
>>>>>>> matching noun(s) as event(s)
>>>>>>>
>>>>>>> *example:*
>>>>>>>
>>>>>>>  inputs:
>>>>>>> sentence   : Alice works at WSO2
>>>>>>>  entityType : NAME
>>>>>>>
>>>>>>>  output: Alice
>>>>>>>
>>>>>>> *2. findNLRegexPattern(sentence, regex)*
>>>>>>>
>>>>>>> *Description:*
>>>>>>>
>>>>>>> This operation takes a sentence and a regular expression as it's
>>>>>>> inputs. It will return each match in the sentence, as an event.
>>>>>>>
>>>>>>> *inputs:*
>>>>>>>
>>>>>>> sentence  : sentence to be processed
>>>>>>> regex       : regular expression to be matched
>>>>>>>  *output:*
>>>>>>>
>>>>>>> matching pharase(s) as event(s)
>>>>>>>
>>>>>>> *example:*
>>>>>>>
>>>>>>> inputs:
>>>>>>>  sentence   : WSO2 was found in 2005
>>>>>>>  regex        : \\d{4}
>>>>>>>
>>>>>>>  output: 2005
>>>>>>>
>>>>>>> *3. findRelationship(sentence, regex)*
>>>>>>>
>>>>>>> *Description:*
>>>>>>>
>>>>>>> This operation takes a sentence and a regular expression as it's
>>>>>>> inputs. For each relationship extracted from the regular expression the
>>>>>>> operation will return a triplet; subject, object and relationship as an
>>>>>>> event.
>>>>>>>
>>>>>>> *inputs:*
>>>>>>>
>>>>>>> sentence  : sentence to be processed
>>>>>>> regex       : regular expression to extract the relationship
>>>>>>>  *output:*
>>>>>>>
>>>>>>> triplet(s) of (subject, object, relationship) as event(s)
>>>>>>>
>>>>>>> *example:*
>>>>>>>
>>>>>>>  inputs:
>>>>>>> sentence   : Bob works for WSO2
>>>>>>>  regex        : works for
>>>>>>>
>>>>>>>  output: (Bob, WSO2, works for)
>>>>>>>  *4. findNameEntityTypeViaDictionary(sentence, dictionary,
>>>>>>> entityType)*
>>>>>>>
>>>>>>> *Description:*
>>>>>>>
>>>>>>> This operation takes a sentence, dictionary file and a predefined
>>>>>>> entity type as it's inputs. It will return noun(s) in the sentence of 
>>>>>>> the
>>>>>>> defined entity type, that also exists in the dictionary as event(s).
>>>>>>>
>>>>>>> *inputs:*
>>>>>>>
>>>>>>> sentence   : sentence to be processed
>>>>>>> dictionary  : dictionary of entities of the defined entity type
>>>>>>> entityType : predefined entity type
>>>>>>>  ORGANIZATION
>>>>>>>   NAME
>>>>>>>  LOCATION
>>>>>>>  *output:*
>>>>>>>
>>>>>>> matching noun(s) as event(s)
>>>>>>>
>>>>>>> *example:*
>>>>>>>
>>>>>>>  inputs:
>>>>>>> sentence    : Bob works at WSO2
>>>>>>>  dictionary   : (WSO2,ORACLE,IBM)
>>>>>>> entityType  : ORGANIZATION
>>>>>>>
>>>>>>> output: WSO2
>>>>>>>
>>>>>>> Each NLP operation defined here will be implemented as a transformer
>>>>>>> extension to Siddhi.
>>>>>>> --
>>>>>>>
>>>>>>> *Malithi Edirisinghe*
>>>>>>> Senior Software Engineer
>>>>>>> WSO2 Inc.
>>>>>>>
>>>>>>> Mobile : +94 (0) 718176807
>>>>>>>  [email protected]
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ============================
>>>>>> Director, Research, WSO2 Inc.
>>>>>> Visiting Faculty, University of Moratuwa
>>>>>> Member, Apache Software Foundation
>>>>>> Research Scientist, Lanka Software Foundation
>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>> Phone: 0772360902
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> *S. Suhothayan*
>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>  *WSO2 Inc. *http://wso2.com
>>>>> * <http://wso2.com/>*
>>>>> lean . enterprise . middleware
>>>>>
>>>>>
>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter:
>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chanuka Dissanayake
>>>> *Software Engineer | **WSO2 Inc.*; http://wso2.com
>>>>
>>>> Mobile: +94 71 33 63 596
>>>> Email: [email protected]
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> *S. Suhothayan*
>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>  *WSO2 Inc. *http://wso2.com
>>> * <http://wso2.com/>*
>>> lean . enterprise . middleware
>>>
>>>
>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter:
>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>>> http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
>>>
>>
>>
>>
>> --
>>
>> *Malithi Edirisinghe*
>> Senior Software Engineer
>> WSO2 Inc.
>>
>> Mobile : +94 (0) 718176807
>> [email protected]
>>
>
>
>
> --
>
> *Malithi Edirisinghe*
> Senior Software Engineer
> WSO2 Inc.
>
> Mobile : +94 (0) 718176807
> [email protected]
>



-- 

*Malithi Edirisinghe*
Senior Software Engineer
WSO2 Inc.

Mobile : +94 (0) 718176807
[email protected]

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [CEP] NLP Toolbox

Reply via email to