Hi everyone,

I am currently looking into *ASTERIXDB-3745* and would like to draft a PR
for this issue to familiarize myself with the index training dataflow ahead
of GSoC 2026. Given that my proposal also focuses on vector queries,
familiarizing myself with the vctree-index Training data flow is a perfect
stepping stone.

*My understanding of the problem:* When a vector index (vctree-index) is
being trained, the system takes a sample of the records. Because AsterixDB
supports Open Types, this random sample might pick up records where the
vector field is missing, null, or has a different dimensionality than what
the index definition requires. Handing this heterogeneous sample to the
math algorithms causes the training to fail.

*My proposed approach:* I plan to implement a zero-copy validation filter
during the dataflow sampling phase. Before a tuple is accepted into the
reservoir, the logic will:

   1.

   Check the "ATypeTag" at the "fieldStartOffset" to ensure the vector is
   not null or missing.
   2.

   Use getFieldLength() to mathematically verify that the byte size of the
   field perfectly matches the expected dimensionality of the index
   (accounting for the type tag and array headers). Any tuple failing these
   checks will simply be skipped by the sampler.

*Where I am currently blocked: *I have my local environment built and have
been trying to trace the execution path. I tried searching for keywords
like train_list and ReservoirSample within the Hyracks and compiler layers,
but I couldn't pinpoint the specific file. Could you please point me to the
right Java class or package to look for?

*(Note: I am reaching out via email because my public Jira account request
is still pending approval, so I cannot comment directly on the ticket yet)*

Thank you,

Tejesh



<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virus-free.www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Reply via email to