+1 to MyCoy's suggestion. To answer your most immediate questions: - Lucene mostly loads metadata in memory at the time of opening a segment (dvm, tmd, fdm, vem, nvm, kdm files), other files are memory-mapped and Lucene relies on the filesystem cache to have their data efficiently available. This allows Lucene to have a very small memory footprint for searching. - Finite state machines are mostly used for suggesters and for the terms index (tip file), which essentially stores all prefixes that are shared by 25-40 terms in a FST.
On Sun, Nov 6, 2022 at 2:12 AM MyCoy Z <mycoy.zh...@gmail.com> wrote: > I just started learning Lucene HNSW source code last months. > > I find the most effective way is to start with the testcases, set debugging > break points in the code you're interested in, and walk through the code > > Regards > MyCoy > > On Fri, Nov 4, 2022 at 9:24 PM Rahul Goswami <rahul196...@gmail.com> > wrote: > > > Hello, > > I have been working with Lucene and Solr for quite some time and have a > > good understanding of a lot of moving parts at the code level. However I > > wish to learn Lucene internals from the ground up and want to > familiarize > > myself with all the dirty details. I would like to know what would be the > > best way to go about it. > > > > To kick things off, I have been thinking about picking up “Lucene in > > Action”, but have been hesitant (and possibly wrongly) since it is based > on > > Lucene 3.0 and we have come a long way since then. To give an example of > > the level of detail I wish to learn (among other things) would be what > > parts of a segment (.tim, .tip, etc) get loaded in memory at search time, > > which part uses finite state machines and why, etc > > > > I would really appreciate any thoughts/inputs on how I can go about this. > > Thanks in advance! > > > > Regards, > > Rahul > > > -- Adrien