Hi Greg, Thanks for quick and detailed answer. What kind of queries do you run? Is it going to work for SpanNearQueries/SpanNotQueries as well? Do you also get the word itself at each position?
It would be great if I could search on the content of each payload as well, but since the payload content is quite complicated and not a simple value I guess it's too much to ask for. What version of Lucene are you using? I'm not sure I'll be able to use the latest fixes. Thanks again, Eran. On Wed, Nov 26, 2008 at 4:47 PM, Greg Shackles <[EMAIL PROTECTED]> wrote: > Sure, I'm happy to give some insight into this. My index itself has a few > fields - one that uniquely identifies the page, one that stores all the > text > on the page, and then some others to store characteristics. At indexing > time, the text field for each document is manually created by concatenating > each word together, separated by spaces. Then the IndexWriter runs the > document through a custom filter that attaches payloads to each token. The > payloads here include all the attributes I need regarding that word, and > most importantly, the index of that word on the page. The tricky part here > was that one of my "words" could map to more than one Lucene token, so I > first create a quick map from my words to which token they should > correspond > to, by running each word through an Analyzer (StandardAnalyzer in my case). > This makes it easy to only attach the payload to the first token for each > of > my words. > > For searching, I pass the search query to a PayloadSpanUtil which gets the > payloads for every match throughout the entire index. I take these results > and put them into a Collection of custom objects, and then sort them first > by page identifier, and then by index on the page. Once I have this list, > I > can quickly iterate through it to find the groupings of payloads that match > the search term (this also helps weed out the occasional bad result that > comes back). I wasn't sure initially if this would be a performance hit > but > it is very quick. Basically what I do is tokenize the search string, then > concatenate all tokens together without spaces into one string. Then when > iterating through I see if the word matches the start of the tokenized > string - if so, chop it off and keep going til the whole string is found. > Then repeat, and so on. It's certainly not the most elegant solution but I > didn't see a better way since PSU doesn't group or sort on its own. > > One other solution I might try if I have time is to take each document from > the original search, put them one at a time into a MemoryIndex and then let > PSU act on that. I'm not sure if this would help/hurt performance but > might > be worth trying. I will also say to make sure you apply Mark's latest > patch > (see the case here: https://issues.apache.org/jira/browse/LUCENE-1465) > since > it fixed some important bugs I had come across. > > I hope this made sense, I haven't finished my morning coffee yet so I can't > be too sure : ) Let me know if you have any more questions. > > - Greg > > > > On Wed, Nov 26, 2008 at 3:19 AM, Eran Sevi <[EMAIL PROTECTED]> wrote: > > > Hi, > > Can you please shed some light on how your final architecture looks like? > > Do you manually use the PayloadSpanUtil for each document separately? > > How did you solve the problem with phrase results? > > Thanks in advance for your time, > > Eran. > > On Tue, Nov 25, 2008 at 10:30 PM, Greg Shackles <[EMAIL PROTECTED]> > > wrote: > > > > > Just wanted to post a little follow-up here now that I've gotten > through > > > implementing the system using payloads. Execution times are > phenomenal! > > > Things that took over a minute to run in my old system take fractions > of > > a > > > second to run now. I would also like to thank Mark for being very > > > responsive in fixing/patching some bugs I encountered along the way. > > > > > > - Greg > > > > > >