Hi, I would like to propose and get feedback on a potential indexing performance improvement for the case that compound file is used (this is the default).
In compound segment mode, each merge operation is ended by writing a compound file. To be more precise, the merge result is first written to directory as a non-compound segment file, and then it is 'converted' into a compound segment file. This conversion involves reading the entire (non compound) segment, and writing it as a compound segment file. This means that compound mode indexing does twice as much indexing writing comparing to non-compound mode. (and there's also the reading of the non-compound segment). The reason for this two steps process in writing compound segment files is that per-segment files cannot be written sequentially, one by one - several files are created together, written interleaved. But I think that there is an intermediate state - between one-compound-segment-file and non-compound-many-files. To my understanding, at merge time, the following apply: - .fnm - field infos - independent of other files. - .fdx .fdt - store fields - interleaved with each other, independent of other files. - .tis .tii .frq .prx - dictionary and postings - interleaved with each other, independent of other files. - .tvx .tvd .tvf - term vectors - interleaved with each other, independent of other files. - .fN - norms - all these files written sequentially, independent of other files. Therefore, a "semi compound" segment file can be defined, that would be made of 4 files (instead of 1): - File 0: .fdx .tis .tvx - File 1: .fdt .tii .tvd - File 2: .frq .tvf - File 3: .fnm .prx .fN A merge should be able to write this segment representation at once, - no need to read and write again. Few questions: (1) is this correct at all, or have I overlooked something? (2) what performance gain would that buy? (3) is it reasonable to have 4 files per segment comparing to 1 file per segment? For (2), the indexing performance of non compound is an upper bound. I compared indexing speeds of compound and non compound, using the Reuters input set. Tried with stored+vectors, and without stored fields: round vect stor cmpnd runCnt recsPerRun rec/s elapsedSec 0 true true true 1 21578 150.2 143.69 - 1 true true false - - 1 - - 21578 - - 178.9 - - 120.58 2 false false true 1 21578 164.7 131.03 - 3 false false false - - 1 - - 21578 - - 184.3 - - 117.07 This is a 19% speed-up with stored+vectors, and 12% speed-up with no stored fields. As a side comment, it says something on IO vs. CPU in Lucene indexing, that cutting 1/2 (I think) the file output speeds-up by less than 20%. But anyhow, this is not a negligible difference, and for real large indexes, and busy systems, when the just written non-compound segment is not in the system caches, it might have more effect. Possibly, search performance during indexing would be improved by less indexing IO. Also, delay for addDocument call that triggers a merge should become smaller. Thanks for your comments, also (but not only) on (1) an (3) above. Doron --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]