Your approach sounds great as well Nick…. > On Apr 9, 2024, at 2:21 AM, Michael Wechner <[email protected]> wrote: > > Thanks for sharing your approach! > > Do you already have some code to share? > > Today I read about https://github.com/infiniflow/ragflow which might also > have some interesting chunking approaches. > > Thanks > > Michael > > Am 09.04.24 um 01:25 schrieb Nick Burch: >> On Mon, 8 Apr 2024, Tim Allison wrote: >>> Not sure we should jump on the bandwagon, but anything we can do to support >>> smart chunking would benefit us. >>> >>> Could just be more integrations with parsers that turn out to be useful. I >>> haven’t had much joy with some. Here’s one that I haven’t evaluated yet: >>> https://github.com/Filimoa/open-parse >> >> I played around with chunking a bit late last year, but owing to not getting >> any of the AI jobs I went for, I didn't get it beyond a rough protype. I can >> say that most people are doing a terrible job in their out-of-the box >> configs... >> >> My current suggested (but not fully tested) approach is: >> * Define a range of chunk sizes that you'd like (min / ideal / max) >> * Parse as XHTML with Tika >> * Keep track of headings and table headers >> * Break on headings >> * If a chunk is too big, break on other elements (eg div or p) >> * If a chunk is too small, and near other small chunks, join them >> * Include 1-2 headings above the current one at the top, >> as a targetted bit of Table of Contents. (eg chunk starts on H3, put >> the H2 in as well) >> * If you broke up a huge table, repeat the table headers at the >> start of every chunk >> * When you're done chunking + adding bits back at the top, convert >> to markdown on output >> >> Happy to explain more! But sadly lacking time right now to do much on that >> >> Nick >
_______________________ Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
