Re: Document chunking

Eric Pugh Tue, 09 Apr 2024 04:48:22 -0700

Your approach sounds great as well Nick….   

> On Apr 9, 2024, at 2:21 AM, Michael Wechner <[email protected]> wrote:
> 
> Thanks for sharing your approach!
> 
> Do you already have some code to share?
> 
> Today I read about https://github.com/infiniflow/ragflow which might also 
> have some interesting chunking approaches.
> 
> Thanks
> 
> Michael
> 
> Am 09.04.24 um 01:25 schrieb Nick Burch:
>> On Mon, 8 Apr 2024, Tim Allison wrote:
>>> Not sure we should jump on the bandwagon, but anything we can do to support 
>>> smart chunking would benefit us.
>>> 
>>> Could just be more integrations with parsers that turn out to be useful. I
>>> haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
>>> https://github.com/Filimoa/open-parse
>> 
>> I played around with chunking a bit late last year, but owing to not getting 
>> any of the AI jobs I went for, I didn't get it beyond a rough protype. I can 
>> say that most people are doing a terrible job in their out-of-the box 
>> configs...
>> 
>> My current suggested (but not fully tested) approach is:
>>  * Define a range of chunk sizes that you'd like (min / ideal / max)
>>  * Parse as XHTML with Tika
>>  * Keep track of headings and table headers
>>  * Break on headings
>>  * If a chunk is too big, break on other elements (eg div or p)
>>  * If a chunk is too small, and near other small chunks, join them
>>  * Include 1-2 headings above the current one at the top,
>>    as a targetted bit of Table of Contents. (eg chunk starts on H3, put
>>    the H2 in as well)
>>  * If you broke up a huge table, repeat the table headers at the
>>    start of every chunk
>>  * When you're done chunking + adding bits back at the top, convert
>>    to markdown on output
>> 
>> Happy to explain more! But sadly lacking time right now to do much on that
>> 
>> Nick
>


_______________________
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
    
This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.

Re: Document chunking

Reply via email to