Hi,

we also index linguistic data, but (someone correct me if I'm wrong) you
have to deal with what the lucene store is offering.
You can store 
usable on the search side : 
  - a term (TermAttribute)
  - the position of the term (PositionIncrementAttribute)
  - an arbitrary payload (PayloadAttribute)
usable when you found results :
  - TermVector (no attribute or OffsetAttribute and/or 
PositionIncrementAttribute) 
  - Any data you stored in a field (arbitrary data)

OffsetAttribute are stored in TermVector (if you specified you wanted
it) you can't search data within the TermPositionVector but you can
iterate your results and ask the reader to return the TermPositionVector
for a specific document and a field.

Lucene can't store arbitrary Attributes they are only useful in a
analyze pipe. You have to serialize (if you want to search for this
info) the data inside the term itself (eg add a char at the end of term
to describe the part of speech) and inside the Payload for position
specific info (eg a relation id, paragraph id or whatever you want :it's
a byte[]).

With those techniques you can do many things, you have to be inventive but
with payloads you can do very interesting things.
You can also store the offsets inside the payload and don't bother with
term vector!
Well there is really hundreds of solutions to deal with linguistic data
inside lucene. What is hard is when you have to deal with relations but
a triplet store should be more adapted for this.

I suggest also to store a serialized form of your internal
representation in the index, it may be more flexible to use it versus
TermPositionvector.

Hope it helps.

On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote:
> I am quite new to Lucene, but I have searched the FAQs and consulted
> the mailinglist archive. I debugged through the source codes as well.
> 
> I have writen an Analyzer, that analyzes a stream by sending it to a
> whole pipeline of linguistic processing and uses the internal
> representation to construct a TokenStream, that tokenizes chunks
> (semantic units). The Term-Attribute String hold the abstract
> representations of those units. For further uses (for instance:
> highlighting the results in text), I need access to the
> OffsetAttribute, that I defined in my TokenStream implementation. Like
> in StandardTokenizer I defined an OffsetAttribute to save the left and
> right values of the original chunks.
> 
> Now I want to search for all documents containing an
> "AdjectivePhrase", get those APs from the Documents and highlight all
> APs in the found documents.
> 
> I tried to find results by getting TermPositions with
> "Reader.termPositions(term)" and then iterate over the positions, but
> the positions only represent the left offset.
> 
> Is there another function to get structured results from term queries
> over documents, where I can get the whole set of attributes, that I
> constructed in the TokenStream with addAttribute(Class)? I did not
> find such a function, but I guess I dont know all retrieval methods of
> Lucene, yet. For my search I used the IndexSearcher.
> 
> Thanks
> Till Kolter
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

-- 
David Causse
Spotter
http://www.spotter.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to