Re: Lucene implementation/performance question

Mark Miller Wed, 12 Nov 2008 09:53:05 -0800

Here is a great power point on payloads from Michael Busch:www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt.Essentially, you can store metadata at each term position, so its anexcellent place to store attributes of the term - they are very fast toload, efficient, etc.

You can check out the spans test classes for a small example using thePayloadSpanUtil...its actually fairly simple and short, and the mainreason I consider it experimental is that it hasn't really been used toomuch to my knowledge (who knows though). If you have a problem, you'llknow quickly and I'll fix quickly. It should work fine though. Overall,the approach wouldn't take that much code, so I don't think youd be outa lot of time.

The PayloadSpanUtil takes an IndexReader and a query and returns thepayloads for the terms in the IndexReader that match the query. If youend up with multiple docs in the IndexReader, be sure to isolate thequery down to the exact doc you want the payloads from (the Span scoringmode of the highlighter actually puts the doc in a fast MemoryIndexwhich only holds one doc, and uses an IndexReader from the MemoryIndex).


Greg Shackles wrote:

Hey Mark,

This sounds very interesting.  Is there any documentation or examples I
could see?  I did a quick search but didn't really find much.  It might just
be that I don't know how payloads work in Lucene, but I'm not sure how I
would see this actually doing what I need.  My reasoning is this...you'd
have an index that stores all the text for a particular page.  Would you be
able to attach payload information to individual words on that page?  In my
head it seems like that would be the job of a second index, which is exactly
why I added the word index.

Any details you can give would be great as I need to keep moving on this
project quickly.  I will also say that I'm somewhat wary of using an
experimental class since this is a really important project that really
won't be able to wait on a lot of development cycles to get the class fully
working.  That said, if it can give me serious speed improvements it's
definitely worth considering.

- Greg


On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <[EMAIL PROTECTED]> wrote:

If your new to Lucene, this might be a little much (and maybe I am not
fully understand the problem), but you might try:

Add the attributes to the words in a payload with a PayloadAnalyzer. Do
searching as normal. Use the new PayloadSpanUtil class to get the payloads
for the matching words. (Think of the PayloadSpanUtil as a highlighter - you
give it a query, it gives you the payloads to the terms that match). The
PayloadSpanUtil class is a bit experimental, but I'll fix anything you run
into with it.

- Mark


Greg Shackles wrote:

Hi Erick,

Thanks for the response, sorry that I was somewhat vague in the reasoning
for my implementation in the first post.  I should have mentioned that the
word details are not details of the Lucene document, but are attributes
about the word that I am storing.  Some examples are position on the
actual
page, color, size, bold/italic/underlined, and most importantly, the text
as
it appeared on the page.  The reason the last one matters is that things
like punctuation, spacing and capitalization can vary between the result
and
the search term, and can affect how I need to process the results
afterwords.  I am certainly open to the idea of a new approach if it would
improve on things, I admit I am new to Lucene so if there are options I'm
unaware of I'd love to learn about them.

Just to sum it up with an example, let's say we have a page of text that
stores "This is a page of text."  We want to search for the text "of
text",
which would span multiple words in the word index.  The final result would
need to contain "of" and "text", along with the details about each as
described before.  I hope this is more helpful!

- Greg

On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <[EMAIL PROTECTED]

wrote:

If I may suggest, could you expand upon what you're trying to
accomplish? Why do you care about the detailed information
about each word? The reason I'm suggesting this is "the XY
problem". That is, people often ask for details about a specific
approach when what they really need is a different approach

There are TermFrequencies, TermPositions,
TermVectorOffsetInfo and a bunch of other stuff that I don't
know the details of that may work for you if we had
a better idea of what it is you're trying to accomplish...

Best
Erick

On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <[EMAIL PROTECTED]>
wrote:

I hope this isn't a dumb question or anything, I'm fairly new to Lucene

so

I've been picking it up as I go pretty much.  Without going into too
much
detail, I need to store pages of text, and for each word on each page,
store
detailed information about it.  To do this, I have 2 indexes:

1) pages: this stores the full text of the page, and identifying
information
about it
2) words: this stores a single word, along with the page it was on and
is
stored in the order they appear on the page

When doing a search, not only do I need to return the page it was found

on,

but also the details of the matching words.  Since I couldn't think of a
better way to do it, I first search the pages index and find any
matching
pages.  Then I iterate the words on those pages to find where the match
occurred.  Obviously this is costly as far as execution time goes, but
at
least it only has to get done for matching pages rather than every page.
Searches still take way longer than I'd like though, and the bottleneck

is

almost entirely in the code to find the matches on the page.

One simple optimization I can think of is store the pages in smaller

blocks

so that the scope of the iteration is made smaller.  This is not really
ideal, since I also need the ability to narrow down results based on

other

words that can/can't appear on the same page which would mean storing 3
full
copies of every word on every page (one in each of the 3 resulting
indexes).

I know this isn't a Java performance forum so I'll try to keep this

Lucene

related, but has anyone done anything similar to this, or have any
comments/ideas on how to improve it?  I'm in the process of trying to

speed

things up since I need to perform many searches often over very large

sets

of pages.  Thanks!

- Greg

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene implementation/performance question

Reply via email to