Jean Louis <bugs@gnu.support> writes: >> Do you parse Org files for chunking? What is the chunking strategy? > > Yes, I parse by headings. That may not be as best. > ... > (headings (rcd-org-get-headings-with-contents)) > ... > (input (concat heading-text "\n" contents)) > (embeddings (rcd-llm-get-embedding input nil "search_document: > "))) > ... > (contents (when (org-element-property :contents-begin hl) > (buffer-substring-no-properties > (org-element-property :contents-begin hl) > (org-element-property :contents-end hl)))))
So, it seems that you are including the whole subtree under heading and then split the text into fixed size chunks. AFAIU, that's not the best strategy, and you may cut the chunks abruptly in the middle of headings/sentence. You may consider something like https://python.langchain.com/docs/how_to/recursive_text_splitter/ Since you can work with AST, it will be trivial to split things all the way down to paragraph level and then split the paragraphs by sentences (if that is necessary). Using meaningful chunking tends to improve vector search and LLM performance _a lot_. -- Ihor Radchenko // yantar92, Org mode maintainer, Learn more about Org mode at <https://orgmode.org/>. Support Org development at <https://liberapay.com/org-mode>, or support my work at <https://liberapay.com/yantar92>