Hi, we also index linguistic data, but (someone correct me if I'm wrong) you have to deal with what the lucene store is offering. You can store usable on the search side : - a term (TermAttribute) - the position of the term (PositionIncrementAttribute) - an arbitrary payload (PayloadAttribute) usable when you found results : - TermVector (no attribute or OffsetAttribute and/or PositionIncrementAttribute) - Any data you stored in a field (arbitrary data)
OffsetAttribute are stored in TermVector (if you specified you wanted it) you can't search data within the TermPositionVector but you can iterate your results and ask the reader to return the TermPositionVector for a specific document and a field. Lucene can't store arbitrary Attributes they are only useful in a analyze pipe. You have to serialize (if you want to search for this info) the data inside the term itself (eg add a char at the end of term to describe the part of speech) and inside the Payload for position specific info (eg a relation id, paragraph id or whatever you want :it's a byte[]). With those techniques you can do many things, you have to be inventive but with payloads you can do very interesting things. You can also store the offsets inside the payload and don't bother with term vector! Well there is really hundreds of solutions to deal with linguistic data inside lucene. What is hard is when you have to deal with relations but a triplet store should be more adapted for this. I suggest also to store a serialized form of your internal representation in the index, it may be more flexible to use it versus TermPositionvector. Hope it helps. On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote: > I am quite new to Lucene, but I have searched the FAQs and consulted > the mailinglist archive. I debugged through the source codes as well. > > I have writen an Analyzer, that analyzes a stream by sending it to a > whole pipeline of linguistic processing and uses the internal > representation to construct a TokenStream, that tokenizes chunks > (semantic units). The Term-Attribute String hold the abstract > representations of those units. For further uses (for instance: > highlighting the results in text), I need access to the > OffsetAttribute, that I defined in my TokenStream implementation. Like > in StandardTokenizer I defined an OffsetAttribute to save the left and > right values of the original chunks. > > Now I want to search for all documents containing an > "AdjectivePhrase", get those APs from the Documents and highlight all > APs in the found documents. > > I tried to find results by getting TermPositions with > "Reader.termPositions(term)" and then iterate over the positions, but > the positions only represent the left offset. > > Is there another function to get structured results from term queries > over documents, where I can get the whole set of attributes, that I > constructed in the TokenStream with addAttribute(Class)? I did not > find such a function, but I guess I dont know all retrieval methods of > Lucene, yet. For my search I used the IndexSearcher. > > Thanks > Till Kolter > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- David Causse Spotter http://www.spotter.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org