Dear all I am happy to send my first email to Lucene community after some time standing aside, following many interesting discussions.
As part of my school project, I am intending to make some improvements in Lucene source code, and I need some advices on how significance my modification work would be. What I am interested so far is the gap encoding scheme in Lucene which is used in DocumentWriter.writePostings() to record the gap positions of a term within a document. The writePostings(), in turn, calls the writeVInt() method to record the gap, which is the byte-aligned coding scheme. I'm thinking of replace the byte-aligned scheme with the "fixed binary" coding scheme mentioned in the paper "Index compression using Fixed Binary Codewords" by Vo Ngoc Anh and Alistair Moffat (the abstract can be found here http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.html<http://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html>). This scheme basically breaks the list of gaps into segments whose gaps (in one segment) will be coded in a fixed data width w (bits). The number of gaps in each segment is recored in a span variable s, and the pair (w,s) form a selector assigned for that segment. By effectively decompose the list, reduce the number of selectors into 16 combinations of relative data size (vs. previous segment) and span, and use greedy algorithm to find suboptimal solutions, the authors claimed that they could achieve better compression effectiveness (measured in bits per pointer averaged across the wholde index), and retrieval time compared to Golomb, interpolative, byte-aligned, and word aligned code schemes. What I wonder at this time is that in the case of Lucene, how possible it is to implement the "fixed binary" scheme that could enhance the performance, and whether there are other parts which I could also consider replacing the gap-encoding scheme. As I've started playing around with Lucene recently, I hope to have your helps to understand Lucene better ^_^ PS: for this type of discussion, which mailing list is most appropriate for my emai? Best regards, Luong Minh Thang