Hi AsterixDB community, I am Tejesh, and I am currently working on supporting Top-k nearest neighbor queries for GSoC 2026. During my research on how R-tree memtables are converted to SSTables, I encountered conflicting information and would like to clarify which approach aligns with the AsterixDB architecture.
The two sources I found state the following: Source 1: When R-trees are converted from RAM to SSTables on disk, only the leaf nodes are extracted, and the internal index entries are discarded to save disk space. These leaf nodes are then sorted by Hilbert values and stored sequentially to minimize random I/O, effectively destroying the original R-tree structure and flattening it out. Whenever the R trees are accessed, disk reads them sequentially Source 2: When R-trees are converted, the index entries are preserved to reduce the search space, despite a slight increase in storage and random I/O. The R-tree structure remains intact, with index entries pointing to child node page IDs. These nodes are then grouped and sorted by Hilbert values to reduce MBR overlaps. Here, disk reading may not be sequential; it could jump between pages of Disk memory. Could you please clarify which of these descriptions accurately reflects how AsterixDB handles this conversion? I have studied and understood secondary indexing, B-trees, R-trees, LSM-ified R-trees, page handling, and developed a few high-level ideas to solve the problem mainly by using a global priority queue. Now I am navigating through the codebase to see how the logic and data flow of these data structures are actually implemented and getting myself familiar with the operators to actually implement my idea to work with Hyracks. I would greatly appreciate it if I could receive a reply. Best regards, Tejesh
