I didn't realise document-length-precision was that unimportant for ranking. What does Google do? If they pull 1 byte per document into memory then - at least according to their claim for the number of documents indexed - that's over 3G. I can't see them equipping their 10,000 linux machines with more than 3G memory each.
Apologies if this is off-topic for this list. Cheers, Jonathan On Wednesday 15 January 2003 04:21, Doug Cutting wrote: > Jonathan Baxter wrote: > > How important is it for I/O performance that Lucene uses only one > > byte to represent document length? Or are there reasons other > > than performance for using so few bits? > > To achieve good search performance, field-length normalization > factors must be memory-resident. So not only must the entire > contents of these files be read when searching, it must also be > kept in memory. With the one byte encoding this means that Lucene > requires a byte per indexed field per document. So a 10M document > collection with five fields requires 50Mb of memory to be searched. > Doubling these to two bytes would double this memory requirement. > Is that acceptable? It depends on who you ask. > > Why do you find this insufficient? The one byte float format (used > in the current, unreleased sources) can actually represent a large > range of values. Its precision is low, but high-precision isn't > usually required for length normalization or Google-style boosting. > > Are you trying to use this for some other purpose in your ranking? > > Doug -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>