Re: [MarkLogic Dev General] Identifying entities in search phrases

seme...@hotmail.com Wed, 13 Jun 2012 11:52:09 -0700

We already have ways to match "Tim Cook" on "Timothy d. Cook". The hard part is 
knowing and being able to pick out "Tim Cook" from a phrase like "Today Time 
Cook visited China". So we can get all permutations ("Today Tim", "Today Time 
Cook", "Tim Cook", "Tim Cook visited", etc.) and query against tuples and find 
good matches, but performance starts to become a concern after a certain number 
of words in the search phrase because the number of permutations gets so large.


It's appears that cts:highlight requires a license for entity enrichment which 
isn't really in our plan. In essence we're trying to mimic what third party 
entity enrichment would be doing but using our semantic data as the source for 
entities to identift.

> From: m...@blakeley.com
> Date: Wed, 13 Jun 2012 11:38:06 -0700
> To: general@developer.marklogic.com
> Subject: Re: [MarkLogic Dev General] Identifying entities in search phrases
> 
> Seems to me that you will need a way to canonicalize the entity in the 
> database, and in the user query. I think cts:entity-highlight exposes 
> something like that as $cts:normalized-text - but you might have to extend 
> the enrichment library to add its value as an attribute or something. You 
> might be able to call cts:entity-highlight on the query, too, but with such a 
> short text string the enrichment code might struggle to find anything.
> 
> This could also be a good fit for the semantic library: turn the identified 
> entities into triples, which is really just a way of canonicalizing them, and 
> store those triples somewhere in the document. Then the user query still has 
> to be expressed in a way that will match that triple, or processed into such 
> a form.
> 
> -- Mike
> 
> On 13 Jun 2012, at 08:46 , seme...@hotmail.com wrote:
> 
> > Does anyone have an elegant (and high performance) way to identify entities 
> > in search phrases based on data you have in the DB? For example, if you had 
> > data on people in the database and you wanted to match the words with 
> > documents in the database.
> > 
> > user types in: "Tim Cook visits China" 
> > and in the DB you have a doc with an element or two of names like "Timothy 
> > D. Cook" and you want to identify "Tim Cook" as being "Timothy D. Cook" in 
> > your DB.
> > 
> > It's possible to create several permutations of the words in the search 
> > phrase and find highly relevant hits, but as the number of words increase 
> > the time to do the search increase more as well.
> > 
> > Note too that it doesn't work very well by taking the entire search phrase 
> > and searching the DB because in the above example "visits China" isn't 
> > relevant to the name info of "Timothy D. Cook" so relevancy may not be so 
> > good and you may get a lot of false positives.
> > 
> > Also note that I am interested not so much in knowing that "Tim Cook" is a 
> > person but that it matches the document in my DB for "Timothy D. Cook". I 
> > want to identify entities in the search phrase based on the data in my DB, 
> > not based on rules of grammar or proper names etc.
> > 
> > Other options that look promising are using cts:classify or using regular 
> > expressions instead of DB queries.
> > 
> > Does anyone have a good idea on how to do this?
> > 
> > Thanks,
> > Ryan
> > _______________________________________________
> > General mailing list
> > General@developer.marklogic.com
> > http://community.marklogic.com/mailman/listinfo/general
> 
> _______________________________________________
> General mailing list
> General@developer.marklogic.com
> http://community.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://community.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Identifying entities in search phrases

Reply via email to