Greg Shackles wrote:
Thanks!  This all actually sounds promising, I just want to make sure I'm
thinking about this correctly.  Does this make sense?

Indexing process:

1) Get list of all words for a page and their attributes, stored in some
sort of data structure
2) Concatenate the text from those words (space separated) into a string
that represents the entire page
3) When adding the page document to the index, run it through a custom
analyzer that attaches the payloads to the tokens
  * this would have to follow along in the word list from #1 to get the
payload information for each token
  * would also have to tokenize the word we are storing to see how many
Lucene tokens it would translate to (to make sure the right payloads go with
the right tokens)
Right, sounds like you have it spot on. That second * from 3 looks like a possible tricky part.
I haven't totally analyzed the searching process yet since I want to get my
head around the storage part first, but I imagine that would be the easier
part anyway.  Does this approach sound reasonable?
Sounds good.
My other concern is your comment about isolating results.  If I'm reading it
correctly, it means that I'd have to do the search in multiple passes, one
to get the individual docs containing the matches, and then one query for
each of those to get the payloads within them?
Right...you'd do it essentially how Highlighting works...you do the search to get the docs of interest, and then redo the search somewhat to get the highlights/payloads for an individual doc at a time. You are redoing some work, but if you think about, getting that info for every match (there could be tons) doesn't make much since when someone might just look at the top couple results, or say 10 at a time. Depends on your usecase if its feasible or not though. Most find it efficient enough to do highlighting with, so I'm assuming it should be good enough here.
Thanks again for your help on this one.

- Greg


On Wed, Nov 12, 2008 at 12:52 PM, Mark Miller <[EMAIL PROTECTED]> wrote:

Here is a great power point on payloads from Michael Busch:
www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt.
Essentially, you can store metadata at each term position, so its an
excellent place to store attributes of the term - they are very fast to
load, efficient, etc.

You can check out the spans test classes for a small example using the
PayloadSpanUtil...its actually fairly simple and short, and the main reason
I consider it experimental is that it hasn't really been used too much to my
knowledge (who knows though). If you have a problem, you'll know quickly and
I'll fix quickly. It should work fine though. Overall, the approach wouldn't
take that much code, so I don't think youd be out a lot of time.

The PayloadSpanUtil takes an IndexReader and a query and returns the
payloads for the terms in the IndexReader that match the query. If you end
up with multiple docs in the IndexReader, be sure to isolate the query down
to the exact doc you want the payloads from (the Span scoring mode of the
highlighter actually puts the doc in a fast MemoryIndex which only holds one
doc, and uses an IndexReader from the MemoryIndex).


Greg Shackles wrote:

Hey Mark,

This sounds very interesting.  Is there any documentation or examples I
could see?  I did a quick search but didn't really find much.  It might
just
be that I don't know how payloads work in Lucene, but I'm not sure how I
would see this actually doing what I need.  My reasoning is this...you'd
have an index that stores all the text for a particular page.  Would you
be
able to attach payload information to individual words on that page?  In
my
head it seems like that would be the job of a second index, which is
exactly
why I added the word index.

Any details you can give would be great as I need to keep moving on this
project quickly.  I will also say that I'm somewhat wary of using an
experimental class since this is a really important project that really
won't be able to wait on a lot of development cycles to get the class
fully
working.  That said, if it can give me serious speed improvements it's
definitely worth considering.

- Greg


On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <[EMAIL PROTECTED]>
wrote:



If your new to Lucene, this might be a little much (and maybe I am not
fully understand the problem), but you might try:

Add the attributes to the words in a payload with a PayloadAnalyzer. Do
searching as normal. Use the new PayloadSpanUtil class to get the
payloads
for the matching words. (Think of the PayloadSpanUtil as a highlighter -
you
give it a query, it gives you the payloads to the terms that match). The
PayloadSpanUtil class is a bit experimental, but I'll fix anything you
run
into with it.

- Mark


Greg Shackles wrote:



Hi Erick,

Thanks for the response, sorry that I was somewhat vague in the
reasoning
for my implementation in the first post.  I should have mentioned that
the
word details are not details of the Lucene document, but are attributes
about the word that I am storing.  Some examples are position on the
actual
page, color, size, bold/italic/underlined, and most importantly, the
text
as
it appeared on the page.  The reason the last one matters is that things
like punctuation, spacing and capitalization can vary between the result
and
the search term, and can affect how I need to process the results
afterwords.  I am certainly open to the idea of a new approach if it
would
improve on things, I admit I am new to Lucene so if there are options
I'm
unaware of I'd love to learn about them.

Just to sum it up with an example, let's say we have a page of text that
stores "This is a page of text."  We want to search for the text "of
text",
which would span multiple words in the word index.  The final result
would
need to contain "of" and "text", along with the details about each as
described before.  I hope this is more helpful!

- Greg

On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <
[EMAIL PROTECTED]


wrote:



If I may suggest, could you expand upon what you're trying to
accomplish? Why do you care about the detailed information
about each word? The reason I'm suggesting this is "the XY
problem". That is, people often ask for details about a specific
approach when what they really need is a different approach

There are TermFrequencies, TermPositions,
TermVectorOffsetInfo and a bunch of other stuff that I don't
know the details of that may work for you if we had
a better idea of what it is you're trying to accomplish...

Best
Erick

On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <[EMAIL PROTECTED]>
wrote:





I hope this isn't a dumb question or anything, I'm fairly new to
Lucene




so




I've been picking it up as I go pretty much.  Without going into too
much
detail, I need to store pages of text, and for each word on each page,
store
detailed information about it.  To do this, I have 2 indexes:

1) pages: this stores the full text of the page, and identifying
information
about it
2) words: this stores a single word, along with the page it was on and
is
stored in the order they appear on the page

When doing a search, not only do I need to return the page it was
found




on,




but also the details of the matching words.  Since I couldn't think of
a
better way to do it, I first search the pages index and find any
matching
pages.  Then I iterate the words on those pages to find where the
match
occurred.  Obviously this is costly as far as execution time goes, but
at
least it only has to get done for matching pages rather than every
page.
Searches still take way longer than I'd like though, and the
bottleneck




is




almost entirely in the code to find the matches on the page.

One simple optimization I can think of is store the pages in smaller




blocks




so that the scope of the iteration is made smaller.  This is not
really
ideal, since I also need the ability to narrow down results based on




other




words that can/can't appear on the same page which would mean storing
3
full
copies of every word on every page (one in each of the 3 resulting
indexes).

I know this isn't a Java performance forum so I'll try to keep this




Lucene




related, but has anyone done anything similar to this, or have any
comments/ideas on how to improve it?  I'm in the process of trying to




speed




things up since I need to perform many searches often over very large




sets




of pages.  Thanks!

- Greg





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to