I have always thought that the norms should be an interface, rather than fixed, as there are many uses of lucene where norms are not necessary, and the memory overhead is substantial.
-----Original Message----- From: Arvind Srinivasan [mailto:[EMAIL PROTECTED] Sent: Sunday, May 22, 2005 7:05 PM To: java-dev@lucene.apache.org Subject: One Byte is Seven bits too many? - A Design suggestion One Byte is Seven bits too many? - A Design suggestion Hi, The norm takes up 1 byte of storage per document per field. While this may seem very small, a simple calculation shows that the IndexSearcher can consume lots of memory when it caches the norms. Further, the current implementation loads up the norms in memory as soon as the segments gets loaded. Here are the calculations: For Medium sized archives docs=40Million, Fields=2 => 80MB memory docs=40Million, Fields=10 => 400MB memory docs=40Million, Fields=20 => 800MB memory For larger sized archives docs=400Million, Fields=2 => 800MB memory docs=400Million, Fields=10 => ~4GB memory docs=400Million, Fields=20 => ~8GB memory To further compound the issues, we have found JVM performance drops when the memory that it manages increases. While the storage itself may not be concern, the runtime memory requirement can use some optimization, especially for large number of fields. The fields itself may fall in one of 3 categories (a) Tokenized fields have huge variance in number of Tokens, example - HTML page, Mail Body etc. (b) Tokenized fields with very little variance in number of token, example - HTML Page Title, Mail Subject etc. (c) Fixed Tokenized Fields example - Department, City, State etc. The one byte usage is very applicable for (a) and not for (b) or (c). In typical usage, field increases can be attributed to (b) and (c). Two solutions come to mind: (1) If there is forethought in the field design, then one can prefix the field tokens and then reduce the number of fields. Of course, this will add the overhead of embarrassing explanation to every query writer of why to add Prefix for every token. If however, this prefix can be done underneath, it may work but still not elegant. (2) The norm values for (c) has only two values. One is 0 when the field is not present, and the other value is a fixed one. In this scenario, the memory requirement is only one bit per doc per field. I would argue that even for (b) one can approximate the value with one bit and not much loose much in ranking of documents. Several implementation options are possible: (a) We can implement the approximation at the time of writing index (Backward compatibility has to be considered) (b) Use a bitset instead of an array for search purposes. I have been wanting to do for the last 6 months, but have found time yet. If I do, will submit an implementation. Also, if a field is mandatory, then the 0 scenario never occurs and in this situation, we use a single constant to represent the array. May be One byte is 8 bits too many:-)) Arvind. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]