Re: [PATCH] download smallest files first but always use one parallel job to download large files concurrently

Dragan Simic Tue, 13 Feb 2024 22:17:11 -0800

On 2024-02-14 07:07, Thomas Reichel wrote:

Assuming there are `n` available parallel connections: initially, the
largest file is selected and downloaded using one connection while the
other `n-1` connections download the smallest files in parallel.


As files are downloaded, the `n-1` connections continue to download
files from smallest to largest while the 1 remaining connection
continues down the file list in the opposite direction-- largest to
smallest. Eventually, the two groups of connections meet somewhere in
the middle of the file list, having completed all of the larger and
smaller files, and that's how we know we're done. Hopefully we run out
of small files at some point before the end and when the two groups of
connections "meet in the middle", all connections are working on
moderately sized files, though that's not guaranteed.

I hope this is more clear.


Thanks for the additional clarification!  Yes, that's how I understood
it initially, but just wanted to make sure we're on the same page.

The proposed improvements look perfect to me.

On 2/13/24 22:36, Dragan Simic wrote:

Hello Thomas,

On 2024-02-14 04:02, Thomas Reichel wrote:

When downloading small files the download time seems dominated by
connection latency rather than bandwidth. Downloading several small
files in parallel is an effective way to reduce the impact oflatency.
However, downloading many small files in parallel does not always
saturate
bandwidth, which is inefficient. This commit attempts to fullyutilize
parallelism to download small files while maintaining high bandwidth
utilization.

I agree that letting the small files take over the downloadconcurrency,

so to speak, degrades the overall performance of the paralleled
downloading.
I've observed that many times.

This is accomplished by downloading the smallest files first using
all but 1 parallel connection while downloading a large file usingtheremaining parallel connection. The result seems to maintain a morestable
download speed throughout entire transactions. This is in contrast to
the usual behavior I observed when downloading many packages, where
the download speed progressively declines as smaller packages are
downloaded.

Just to make sure I understood this correctly, first the largest fileis

selected and downloaded using one connection, while the remaining
connections

are used to download small files in parallel? The process continuesuntil

the small files are all downloaded, at which point the remaining large
files
are downloaded using all connections in parallel?

Could you, please, let me know am I correct there?

When my entire cache is deleted and all packages are redownloadedusing

` pacman -Qqn | sudo pacman -Sw -`, the mean download speed is 47.8

MiB/s. After this patch, the mean download speed is 54.0 MiB/s. Interms

of time savings, this patch causes a 14.9 GiB download to go from 5
minutes 20 seconds to 4 minutes 43 seconds on my system and network.

Your mileage may vary on different systems, networks, and selectionsof

packages. I expect there to be virtually no effect on selections of
packages that are all fairly large.

Signed-off-by: Thomas Reichel <tom.p.reic...@gmail.com>
---
 lib/libalpm/dload.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/lib/libalpm/dload.c b/lib/libalpm/dload.c
index 7c344760..4d8777f9 100644
--- a/lib/libalpm/dload.c
+++ b/lib/libalpm/dload.c
@@ -884,7 +884,7 @@ cleanup:
 }

 /*
- * Use to sort payloads by max size in decending order (largest ->
smallest)
+ * Use to sort payloads by max size in ascending order (smallest ->
largest)
  */
 static int compare_dload_payload_sizes(const void *left_ptr, const
void *right_ptr)
 {
@@ -893,7 +893,7 @@ static int compare_dload_payload_sizes(const void
*left_ptr, const void *right_p
     left = (struct dload_payload *) left_ptr;
     right = (struct dload_payload *) right_ptr;

-    return right->max_size - left->max_size;
+    return left->max_size - right->max_size;
 }

 /* Returns -1 if an error happened for a required file
@@ -910,18 +910,39 @@ static int curl_download_internal(alpm_handle_t
*handle,
     int updated = 0; /* was a file actually updated */
     CURLM *curlm = handle->curlm;
     size_t payloads_size = alpm_list_count(payloads);
+    /* download a single large file along with smaller files */
+    struct dload_payload *large_file = NULL;
+    /* a reverse iterator for payloads */
+    alpm_list_t* payloads_reverse;

     /* Sort payloads by package size */
     payloads = alpm_list_msort(payloads, payloads_size,
&compare_dload_payload_sizes);

+    payloads_reverse = alpm_list_last(payloads);
+
     while(active_downloads_num > 0 || payloads) {
         CURLMcode mc;

         for(; active_downloads_num < max_streams && payloads;
active_downloads_num++) {
-            struct dload_payload *payload = payloads->data;
+            struct dload_payload *payload;
+            if (large_file == NULL) {
+                payload = payloads_reverse->data;
+            } else {
+                payload = payloads->data;
+            }

if(curl_add_payload(handle, curlm, payload, localpath)==

0) {
-                payloads = payloads->next;
+                if (payloads_reverse == payloads) {
+                    /* pointers met in the middle, we finished by
advancing either */
+                    payloads = NULL;
+                    break;
+                }
+                if (large_file == NULL) {
+                    large_file = payloads_reverse->data;
+                    payloads_reverse = payloads_reverse->prev;
+                } else {
+                    payloads = payloads->next;
+                }
             } else {
                 /* The payload failed to start. Do not start any new
downloads.
                  * Wait until all active downloads complete.
@@ -959,6 +980,16 @@ static int curl_download_internal(alpm_handle_t
*handle,
                     payloads = NULL;
                     err = -1;
                 } else if(ret == 0) {
+                    struct dload_payload *finished_payload;
+                    CURLcode curlerr;
+                    /* which file downloaded? */
+                    curlerr = curl_easy_getinfo(msg->easy_handle,
CURLINFO_PRIVATE,
&finished_payload);
+                    ASSERT(curlerr == CURLE_OK, RET_ERR(handle,
ALPM_ERR_LIBCURL, -1));
+                    /* did we just finish the large file that was
downloading? */
+                    if (large_file != NULL && finished_payload ==
large_file) {
+                        large_file = NULL;
+                    }
+
                     updated = 1;
                 }
             } else {

Re: [PATCH] download smallest files first but always use one parallel job to download large files concurrently

Reply via email to