Jean Louis <bugs@gnu.support> writes:

>> Do you parse Org files for chunking? What is the chunking strategy?
>
> Yes, I parse by headings. That may not be as best.
> ...
>          (headings (rcd-org-get-headings-with-contents))
> ...
>              (input (concat heading-text "\n" contents))
>              (embeddings (rcd-llm-get-embedding input nil "search_document: 
> ")))
> ...
>                (contents (when (org-element-property :contents-begin hl)
>                          (buffer-substring-no-properties
>                             (org-element-property :contents-begin hl)
>                             (org-element-property :contents-end hl)))))

So, it seems that you are including the whole subtree under heading and
then split the text into fixed size chunks.

AFAIU, that's not the best strategy, and you may cut the chunks abruptly
in the middle of headings/sentence. You may consider something like
https://python.langchain.com/docs/how_to/recursive_text_splitter/
Since you can work with AST, it will be trivial to split things all the
way down to paragraph level and then split the paragraphs by sentences
(if that is necessary).

Using meaningful chunking tends to improve vector search and LLM
performance _a lot_.

-- 
Ihor Radchenko // yantar92,
Org mode maintainer,
Learn more about Org mode at <https://orgmode.org/>.
Support Org development at <https://liberapay.com/org-mode>,
or support my work at <https://liberapay.com/yantar92>

Reply via email to