Re: Making document numbers persistent

Erick Erickson Sat, 13 Jan 2007 18:54:39 -0800

A couple of things...

1> You're probably already aware that the indexreader doesn't reflect
updates until it is re-opened, so any filters you cached would be valid
until you re-opened the reader. CachingWrapperFilter will store the Lucene
filters for you. But this probably isn't germane to your problem..


2> If you are closing and re-opening your readers, this is a big performance
hit. Is it possible that that's what you're seeing? (I doubt this, but I
thought I'd bring it up).

3> I doubt it really will make a performance difference, but you could use
TermDocs.seek rather than get a new termdocs for each term from the reader.
(and if this *does* make a difference, please let me know)

4> It's playing with fire, but.... you say "in essence, we want persistent
Lucene document numbers". I believe they *are* persistent until and unless
you optimize *after* deleting documents. So you control when they change
(you'll get more information by searching the mail archive, but what to
search for escapes my poor memory). So it *may* be possible to, say,
optimize your index (and record the user-id/luceneid pairs) at discrete
points in time and/or synchronize this correspondence when convenient.
Perhaps in another index or orthogonal documents.

5> Is there any chance whatsoever of inverting your problem? That is, make
the database use the Lucene IDs as the primary key (assuming you can control
when the Lucene IDs change as above)? This is out there on the fringes of
possibility and I'd be really surprised if you could.... but you're
desperate <G>. You'd essentially have to be able to rebuild your database
whenever you re-optimized your index, a bit of the tail wagging the dog
here.....

6> Can you post-filter instead of pre-filter your queries? Essentially, when
you get your search results, ask "is the user in the my set of users"? It
depends upon whether you need to return the whole set or the top N documents
I suppose, as well as the result-set size. If your results (without the
filter) are too high probably not..... I'm assuming that there must be other
clauses you attach the filter to.

7> If you can't re-order your database, can you invert the problem by
maintaining a table in the database with this correspondence that changes as
the Lucene index changes and use a query on *that* to populate your filter?
You're right in that this part of your application is using Lucene for
something other than it was intended for. You're really running into trouble
when you're trying to use Lucene like a RDBMS. Maybe it's the correct thing
to do to move the RDBMS-like actions to one of those....

Anyway, that exhausts my creativity this evening. And Mark's right. People
way more knowledgeable than me will be on the lists Monday.....

Best of luck!
Erick

On 1/13/07, Kay Roepke <[EMAIL PROTECTED]> wrote:

Hi Erick!

On 13. Jan 2007, at 19:54 , Erick Erickson wrote:

> Before going off into modifying things, could you expand a bit on
> how you
> query to build up the filter? Perhaps providing a code snippet?

We are passing in our unique ids from our database which we have to
translate
to lucene document ids. This is done by an API (our own API) call,
because the
main application isn't written in Java. Lucene will function as a
remote service
for the other application servers.

> Just to be sure we're talking about the same thing, when you say
> filter, are
> you talking about Lucene filters? I'm assuming you are, in which
> case there
> is probably wisdom on the list (although I won't provide very much
> <G>).
> building up a Lucene filter with termenum/termdocs has been quite
> fast in my
> experience, but I don't know if my experience has any relevance to
> your
> situation....

Yes, I was talking about Lucene filters. Here's what we do currently
(pretty much
standard, if I'm correct):

public class IdQueryFilter extends Filter {

     Collection users;

     public IdQueryFilter(Collection users) {
         this.users = users;
     }

     public BitSet bits(IndexReader index) throws IOException {
         BitSet result = new BitSet();
         Iterator it = users.iterator();
         while (it.hasNext()) {
             Term term = new Term( "id", new Long(((User)it.next
()).id).toString());
             TermDocs terms = ((IndexReader)index).termDocs( term );
             if (terms.next()) {
                 result.set(terms.doc());
             }
             terms = null; term = null;
         }
         return result;
     }
}

This can take up to 30sec for a large (~500.000 elements) collections
of users and it
it the thing I'm currently trying to solve.
I can handle situations where this can take long once, since I'm
really asking something
that Lucene isn't designed for, but the culprit is that I can't
really cache the resulting
bitset. I can cache it on one of the Lucene servers, but can't share
it among the rest of
the servers (we will eventually have way more than one for scalabilty/
reliability reasons).
We cannot afford to calculate these bitsets on all servers (think of
a repeated search, or
paging, when you cannot make sure that you will hit the same Lucene
application to do the
search - you might end up on a different server that hasn't seen a
request before).

I hope this makes it more clear of what I'm up against. I'm not
running around to change things
for the change's sake. If I can get around it, fine. If not, I can
deal :)

Thanks,

Kay
--
Kay Röpke
http://classdump.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Making document numbers persistent

Reply via email to