I just love it when I get so wrapped up in a particular approach that
alternatives don't occur to me. So I wondered what would happen if I just
got stupid simple and tried solving what I think is your problem without
involving lucene.

So, I wrote a little program to fill up a HashMap with <Integer, Integer>
pairs, with the key being a random number and the value an integer starting
with 0. Really, mimicking a map of your <dbid, lucdocid> pairs. Then fill up
a bitset by looking up a bunch of dbids and setting the corresponding
lucdocid in the bitset.

Map size, 10,000,000 <userid, lucdocid> pairs
Looking up 1,000,000 user ids and setting them in a bitset.

Total time to set all the bits, 1.016 seconds. Running inside of Eclipse on
a 2700 MH AMD with 1G memory (and I used up almost all this memory, but made
no attempt to optimize it at all).

Yes, that's close enough to one second not to matter.

I started by wondering what would happen if I used a RAMDir to map the
<userid, lucdocid> pairs, thinking you could generate that RAMDir during
warmup, but wanted to get a baseline for the bitset part before dealing with
Lucene. But it *is* Sunday, and this is *not* my problem, so after I got
this number I decided to leave the rest of it as an "exercise for the reader
<G>". But we're having rain/sleet combinations here in SE Michigan, so what
the heck....

I  wonder, if this approach doesn't work for you, what would happen if you
built a RAMDir with this mapping (which  keeps your issues with updating
under control). If memory use is too intensive, I also wonder what would
happen if you built a FSDirectory index with these pairs as part of warm-up.

Just creating the map takes considerable time in my test program, so you
probably want to consider some kind of warm-up process....

Best of luck!
Erick

On 1/14/07, Kay Roepke < [EMAIL PROTECTED]> wrote:


On 14. Jan 2007, at 3:54 , Erick Erickson wrote:

> 3> I doubt it really will make a performance difference, but you
> could use
> TermDocs.seek rather than get a new termdocs for each term from the
> reader.
> (and if this *does* make a difference, please let me know)

It seems it does. I have just changed it to use seek, and the time
went from 40 sec to a little
over 29secs. Still to slow, but it's the right direction :)

cheers,
-k
--
Kay Röpke
http://classdump.org/





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to