Re: downloading t.mbox.gz messages are not sorted in expected order

Jacob Keller Mon, 15 Apr 2024 14:07:11 -0700

On 4/13/2024 12:58 AM, Eric Wong wrote:
> Jacob Keller <[email protected]> wrote:
>>
>>
>> On 4/11/2024 3:42 PM, Konstantin Ryabitsev wrote:
>>> On Thu, Apr 11, 2024 at 03:32:43PM -0700, Jacob Keller wrote:
>>>> I sometimes download patch series off of public inbox hosted servers to
>>>> apply with git-am. Occasionally I have found that these do not apply
>>>> cleanly because the thread is not sorted in patch order.
>>>
>>> It's more than just the order -- if there are replies in the thread, the 
>>> mbox
>>> file won't apply either.
>>>
>>
>> If the order was correct, it is usually easy enough to just "git am
>> --skip" the patches which have no content. However...
>>
>>> This is the reason why the b4 tool exists:
>>> https://b4.docs.kernel.org/
>>>
>>
>> This is extremely useful and I was unaware of its existence. Thanks!!
> 
> Good to know b4 works for you.
> 
> FWIW, t.mbox.gz uses NNTP article number ordering to ensure
> batched fetches work and duplicates can't get served.
> 
> IOW, it fetches a batch of 1000 header rows at a time from a
> single thread to avoid using too much memory for a single
> request.  The next batch (another 1K) only gets fetched once the
> current batch is done.  So it must order by article number to
> deal with that, especially since new messages may appear in the
> thread while the current batch is being streamed.
> 
> Identical Date: headers can appear multiple times in the same
> thread, so using a >= or > comparison for retrieval wouldn't
> work.
> 
> Of course, most threads are <1000 messages, so I did think
> about sorting by Date for small threads (as we do for the HTML
> output)...
> 
> However with the current t.mbox.gz code, we expect (and can
> handle) new messages appearing while a t.mbox.gz is being
> served.  So if a thread has 10 messages, the first batch fetch
> would only return those 10.  However, while a client is slowly
> downloading the first 10 messages, more messages show up.  The
> current retrieval scheme allows new messages in a thread to show
> up without needing another request.
> 
> AFAIK, it's actually easier and fewer SQL statements to do the
> current way.

And given that there are reasons to download a thread than just patches,
I think it makes sense. I can use b4 and get exactly what I want with
not much more or different effort on my part, and the public-inbox side
doesn't need to bloat to handle that, or make other cases fragile or slower.

Thanks,
Jake
Re: downloading t.mbox.gz messages are not sorted in expected order

Reply via email to