Dear all

I am happy to send my first email to Lucene community after some time
standing aside, following many interesting discussions.

As part of my school project, I am intending to make some improvements in
Lucene source code, and I need some advices on how significance my
modification work would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writeVInt() method to record the gap, which is the byte-aligned
coding scheme.

I'm thinking of  replace the byte-aligned scheme with the "fixed binary"
coding scheme mentioned in the paper "Index compression using Fixed Binary
Codewords" by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
here 
http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.html<http://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html>).
This scheme basically breaks the list of gaps into segments whose gaps (in
one segment) will be coded in a  fixed data width w (bits). The number of
gaps in each segment is recored in a span variable s, and the pair (w,s)
form a selector assigned for that segment. By effectively decompose the
list, reduce the number of selectors into 16 combinations of relative data
size (vs. previous segment) and span, and use greedy algorithm to find
suboptimal solutions, the authors claimed that they could achieve better
compression effectiveness (measured in bits per pointer averaged across the
wholde index), and retrieval time compared to Golomb, interpolative,
byte-aligned, and word aligned code schemes.

What I wonder at this time is that in the case of Lucene, how possible it is
to implement the "fixed binary" scheme that could enhance the performance,
and whether there are other parts which I could also consider replacing the
gap-encoding scheme.

As I've started playing around with Lucene recently, I hope to have your
helps to understand Lucene better  ^_^

PS: for this type of discussion, which mailing list is most appropriate for
my emai?

Best regards,

Luong Minh Thang

Reply via email to