On 6/29/20 11:34 PM, Raphael Hertzog wrote:
> On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote:
>>> Indeed, creating a dedicated service for this does not seem a good idea.
>> I would love to have this feature integrated directly with
>> distro-tracker. However, I'm wondering about the load that would case
>> for the service.
> Network request do not generate much "load", such processes spend the bulk
> of their time waiting on the network.

True that.

>> The duck worker has to process around 460000 urls (only counting
>> Homepage) in less than 24h.
> How do you get to that figure? We don't have that many source package
> and even if you consider multiple URL for each source package due to
> changes over time (in multiple releases), that makes way too many URLs
> per source package.

Err, sorry about that. That figure is the result of:

$ curl -s
http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
zgrep -v Homepage: | sort -u | wc -l

Which is obviously wrong. Here is the real number:

$ curl -s
http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
zgrep Homepage: | sort -u | wc -l

>> I'm not sure that can done properly using
>> the distro-tracker tasks (parallel workers are needed to work around
>> timeout). Obviously that can be optimized (different check delay for
>> different results) but that's still bulk network related tasks.
> Nothing forbids parallel workers and in any case, I welcome any
> improvement to the task mechanism to make that kind of parallelism easier
> to handle.
> There are other tasks that could benefit from this (and in general I want
> to merge more of such features in distro-tracker to make them available to
> derivatives too).

Then, let's add this to distro-tracker :)

I've created an issue on the project on salsa so we can discuss
technical details :


As I've said before, I would like to finish up on a couple of other
projects (namely mentors.d.n and snapshot.d.o) and I will be available
right after that.

Baptiste BEAUPLAT - lyknode

