Re: More about storing NLP-type stuff in the index

Michael Sokolov Thu, 03 Jan 2013 17:59:04 -0800

On 1/3/2013 6:16 PM, Wu, Stephen T., Ph.D. wrote:

I think we've been saying that if we put something in a Payload, it will be
indexed.  From what I understand of the indexing format, that means that
what you put in the Payload will be stored in the Lucene index... But it
won't *itself* be indexed & optimized for search.


That's good, but can we build inverted indices on the contents of the
Payloads (or the Attributes) as well?
  Ex1: Say I put semantic role labels like ARG0 into my index. Say my search
is looking for all instances of ARG0.
  Ex2: Say I add payloads to terms indicating that they're named entities
belonging to a semantic group.  Then say my query looks for all instances of
the "Medications" semantic group.

It's almost like just putting these things in different fields, with the
exception that the things in different fields need to be linked so you know
what the original text was.  Maybe the linking can be done via Payloads
(offsets in the original text)?  If I want to store multiple things at the
same startOffset then I just use something like SynonymFilter?

I've been working on a different but (in a way) related problem:indexing text in XML documents. In that case, we want to associate thenames of enclosing elements with each term so that it's possible tosearch for (say) "ermine" in the context /doc/title as distinct from"ermine" in the context of //paragraph, or something like that. Anywaywhat I've done doesn't use payloads. I index two fields that arerelevant to this: a full text field, which is just the usual text index(per document), and then an element-text field which indexes each termas a concatenation of the element name and the term value, so:title:ermine, doc:ermine, and paragraph:ermine would be typical terms.I index all of the enclosing element names for each word at the sameposition (like synonym filter does). This relies on a magical character(":") that isn't allowed to appear in any tokens, which is too bad, butnot terribly restrictive.

Something like this might work for you. The prefixing also has the nicefeature that when you enumerate terms, they are ordered first by prefix:of course you could flip the order if it were more interesting to listall "contexts" for a word rather than all words in a context (or withsome POS tag).


-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: More about storing NLP-type stuff in the index

Reply via email to