Hi - One thing to consider is field norms. If your fields aren't analyzed, this doesn't apply to you.
But if you do have norms, I believe that it's one by per field with norms x number of documents. It doesn't matter if the field occurs in a document or not, it's nTotalFields x nDocs. So, an index with 10,000 documents, with one field each, same field for all docs: -rw-r--r-- 1 tom wheel 10004 Dec 30 18:54 _0.nrm a 10,000 doc index, where each doc has one of 100 different field names, but still only one field per doc: -rw-r--r-- 1 tom wheel 1000004 Dec 30 18:55 _0.nrm As you can see, your total space for norms goes up linearly with the number of fields. Tom On Wed, Dec 30, 2009 at 10:19 AM, Renaud Delbru <renaud.del...@deri.org>wrote: > Hi, > > just sharing some personal experiences in this domain, > > We performed some benchmarks in a similar setup (indexing millions of > documents with thousands of fields) to measure the impact of large number of > fields on a Lucene index. > We observed that more you have fields, more the dictionary will become > large. In fact, Lucene is creating term index by concatenating the field > name with the terms associated to this field. In the worst case scenario > (when terms occur in every field), you can have M fields * N terms entries > in the dictionary. > As a consequence, a term lookup in the dictionary will take longer. This > can have a significant impact when the cache is cold. When the cache is warm > (when parts of the dictionary are in memory), the time overhead is not > significant, even null. > > If, in your data collection, the majority of the terms are locals to one > field (i.e., when a term occurs only in one single field, like for example > the timestamp of the document), your dictionary will not grow that much and > therefore you will probably not notice the increase in time of dictionary > lookups. > > And, as erick explained previously, its depends also of the size of your > index, and of your use case. If it is relatively small, the overhead will be > imperceptible. However, if your index is large (millions of documents), you > will probably notice the overhead the first time the query is executed > (cold-cache). > > I haven't tested with Lucene 3.0, but I have read somewhere (correct me if > I am wrong) that this new version includes some optimisations for dictionary > lookups, which should minimize the overhead. > -- > Renaud Delbru > > > On 30/12/09 16:18, Jason Tesser wrote: > >> I have a situation where I might have 1000 different types of Lucene >> Documents each with 10 or so fields with different names that get >> indexed. >> >> I am wondering if this is bad to do within Lucene. I end up with >> 10,000 fields within the index although any given document has only 10 >> or so. >> >> I was hoping not to have to have many indexes under the covers if I >> can avoid it but I don't want performance to suffer either. >> >> Any thoughts? >> >> Thanks, >> Jason Tesser >> dotCMS Lead Development Manager >> 1-305-858-1422 >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >