Re: Sort by date THEN by relevancy

KEGan Sat, 30 Sep 2006 22:37:38 -0700

Erick,

Wow!! Thanks for the invaluable advice and sharing of your experience :) I
greatly appreciate them.


Alright, I think it really make the most sense to follow the path of least
resistant first, then see if it really need optimization.

Thanks a lot.

~KEGan


On 9/30/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

See below

On 9/29/06, KEGan <[EMAIL PROTECTED]> wrote:
>
> Erick,
>
> Thanks for the great advice!!
>
> About closing/opening searcher on each request .... isnt this
unavoidable
> in
> some cases? The application I am building will have users insert/search
> documents all the time. So for every insert, the searcher need to be
> recreated again, isnt it? Else new document wont be searchable with the
> 'old' searcher, right?

Yep, if your requirement is that the documents be immediately searchable
then the searcher will have to be opened/closed each time. Except you can
get as clever as you want with this.

Say, for example, you keep two indexes, one in RAM and one on disk. The
algorithm then goes something like this.

For a search:

Open FSDir-based index *searcher*
Open RAM-based index writer.
For each search {
  Open ram-based searcher
  Use a multi-searcher on the FSDir and RAM-based indexes
  Close the ram-based searcher;
  return results
}

User adds document, just add it to the RAM AND FS-based indexes. Searches
will get this immediately by the code above. Whether you open/close the
FS-based writer for each document added is your option.

Periodically close the FS-based readers and writers and re-create the
RAM-based index.

NOTE: If it was me, I'd actually write to a COPY of the FS-based index,
and
when you synchronize (i.e. close/reopen the FS-based searcher), copy your
most recent FS-based index to your "real" directory before re-opening your
FS-based index.

Also, you'll have to do some good coordination so you don't accidentally
get
the same document from both indexes.

All that said, are you really, really, really sure that additions to the
index *must* be immediately available? Or would, say, a 10 minute (or even
1
hour) delay be acceptable? I'd recommend you check carefully, since I've
often seen cases where if you ask your product manager what "immediate"
means, and explain that they can have a product faster with fewer bugs if
they accept something like, say, an hour latency and you could spend the
time on some *other* feature, the PM will say "fine". especially if you
explain that you can still do the immediate thing later.

In particular, I'd think about selling this as something that you'll
change
later, and do it the simple way (in this case, just close/reopen the
searchers every 1/2 hour, say) in the interests of getting something into
the hands of the users sooner. To be sure, build the simple case with the
notion of immediate searches in mind, but don't spend the time actually
doing the complex thing first off. Then, one of three things will happen:
1>
in actual real-world use, the delay is just fine and you can work on other
things that are more valuable, or 2> the delay isn't acceptable and you
have
to implement the change, or 3> the dealy isn't acceptable but never gets
high enough on the priority list to get fixed, in which case it's really
situation <1>. In case <2>, you haven't lost any time to speak of. In <1
or
3>, you have saved time you can spend working on something *else* of more
importance, .

This long diatribe is mostly based on the eXtreme Programming/Agile
methodologies model, and it's Saturday and I'm not at work <G>. I've spent
faaaar to much of my professional life working on unimportant features of
a
project at the expense of things that are actually *useful* because I've
uncritically accepted requirements like "the documents must be immediately
searchable" that the product managers would gladly forgoe if they knew
they
could get a different feature if they would accept a 1/2 hour delay.

For this situation, I am thinking of recreating the searcher (and do a
warm
> up) inside the thread that do the insert. With this, the performance
> penalty
> occurs to the user that does the insert. Also for my application, there
> will
> be more searches than inserts.

Well, really, the general case here is that you want the warmup to happen
outside the "current" searcher. Again, before you get fancy, just see if
the
delay for the searcher when re-opening it is acceptable. I have no idea
what
your actual delay for the first search after opening your index is. Don't
go
the coordination route until you *know* it's unacceptable.

But if the delay is unacceptable, there are several alternatives. In
general, you could easily have  a thread that is your "warmer-upper" that
could even be your document add code. But, I suspect this is more complex
than you think. what happens if two users add documents at the same time?
How do you get the "right" warmed-up searcher? Will there be collisions?
How
do you debug this kind of thing? If you *must* do this, I'd make sure all
the code for warming things up is in the searcher. Upon some signal, it
fires up a thread that opens a searcher and warms it up. Upon thread
termination, swap the actual searcher you're using for requests with the
just-warmed-up one. But *please* make sure you need to first.

'Good Lord has gotten long!

Is this what people normally do?

I have no clue. Each situation is different <G>.

Thanks.
>
> ~KEGan
>
>
> On 9/30/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
> >
> > Sorting will inevitably have an impact on your speed, but it's
> impossible
> > to
> > generalize. FWIW, my app has 870K documents, the index is around
1.4Gand
> > search/sort times are fine. But even that statement is misleading.
> "Fine"
> > means that the product manager for this product is satisfied with
> > performance, which has no relevance to your situation <G>......
> >
> > I'm afraid that you'll just have to put in your sorting and see. I
know
> > that's not a very satisfactory answer, but without knowing lots of
> details
> > about your app AND the distribution of terms AND the expected
throughput
> > AND
> > the usage statistics AND.......,  it's hard to say.
> >
> > It was easy to put together a test harness that fired off a bunch of
> > threads
> > at my searcher and measured throughput. I highly recommend something
> like
> > this if you're going to try to answer this question before putting the
> > product in production, just so you get an idea of what to expect.
> >
> > Be sure you are satisfied with the performance before adding sorting.
> Lots
> > of people have gotten into trouble by opening/closing searchers for
each
> > request, which is FAR more expensive that sorting in my experience. It
> > would
> > be unfortunate to think your problem was sorting when, in fact, it was
> > something else.
> >
> > Best
> > Erick
> >
> > On 9/29/06, KEGan <[EMAIL PROTECTED]> wrote:
> > >
> > > Erick,
> > >
> > > Ouch!! Please excuse the cut-n-paste ;)
> > >
> > > LIA mentions a lot about performance when doing sorting. Is it
> something
> > > to
> > > be cautious about? You mention doing 5 fields and it works ok, ...
can
> > > share
> > > with us how many documents you are handling there with 5 fields ?
> > >
> > > Thanks.
> > >
> > > ~KEGan
> > >
> > > On 9/29/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Yes. I do this with 5 fields and it works just fine. Although your
> > > > cut-n-paste got kind of hard to read <G>....
> > > >
> > > > Erick
> > > >
> > > > On 9/29/06, KEGan <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > I think I am going to answer my own question.
> > > > >
> > > > > Just use the
> > > > >
> > > > > *Sort*<
> > > > >
> > > >
> > >
> >
>
file:///D:/library/apache/lucene-2.0.0/docs/api/org/apache/lucene/search/Sort.html#Sort(org.apache.lucene.search.SortField[])
> > > > > >
> > > > > (SortField<
> > > > >
> > > >
> > >
> >
>
file:///D:/library/apache/lucene-2.0.0/docs/api/org/apache/lucene/search/SortField.html
> > > > > >
> > > > > [] fields)
> > > > > *Sort*<
> > > > >
> > > >
> > >
> >
>
file:///D:/library/apache/lucene-2.0.0/docs/api/org/apache/lucene/search/Sort.html#Sort(java.lang.String[])
> > > > > >
> > > > > (String <
> http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html
> > >
> > > > > [] fields)
> > > > >
> > > > > This should do it right ?
> > > > >
> > > > >
> > > > >
> > > > > On 9/29/06, KEGan <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have seen some sort examples in LIA. But cant find what I am
> > > looking
> > > > > > for. How do I sort document by date, AND for all the documents
> > with
> > > > the
> > > > > same
> > > > > > date ... these are sorted by relavency. (Date has higher sort
> > > priority
> > > > > in
> > > > > > this case).
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> >
> >
>
>

Re: Sort by date THEN by relevancy

Reply via email to