David Spencer wrote:

Anson Lau wrote:

Hi All,

Has anyone seen the project MG4J (Managing Gigabyte for Java)
http://mg4j.dsi.unimi.it/ ?  Anybody knows enough about both Lucene
and MG4J to comment on how the two compares?


I've wondered if Lucene does comparable (key/index) compression to what the related book (Managing Gigabytes, excellent BTW) describes...

Just a question: my personal experience with a commercial engine i partly developed is the the "continuation bit" (aka altavista solution) is a good and efficient solution w.r.t gamma code, delta code and other codes used for variable lenght int rappresentation (see MG).


Given an int say n, continuation bit is just to consider a byte as 7 bit + 1 bit used to say if the next byte is also used to rappresent n.

On the average you will loose some bit on small gaps between contiguos integer in the posting list, but they are not that much since on large collections gaps are large. But you can operate on machine oriented word lenght instead of bit operations which are much more expensive.

I saw a small increment on the index size, but a big saving on query time. Any similiar / opposite experience?

--
"We have no credible evidence that Iraq and Al Qaeda cooperated on attacks against the United States."
Staff report of the commission investigating the Sept. 11 attacks.



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to