Re: Suggestions on mechanism or existing code - maintain persistence of file download history

DL Neil via Python-list Wed, 29 Jan 2020 14:01:14 -0800

On 30/01/20 10:38 AM, jkn wrote:

On Wednesday, January 29, 2020 at 8:27:03 PM UTC, Chris Angelico wrote:

On Thu, Jan 30, 2020 at 7:06 AM jkn <jkn...@nicorp.f9.co.uk> wrote:

I want to be a able to use a simple 'download manager' which I was going to 
write
(in Python), but then wondered if there was something suitable already out 
there.
I haven't found it, but thought people here might have some ideas for existing 
work, or approaches.


The situation is this - I have a long list of file URLs and want to download 
these
as a 'background task'. I want this to process to be 'crudely persistent' - you
can CTRL-C out, and next time you run things it will pick up where it left off.


A decent project. I've done this before but in restricted ways.

The download part is not difficult. Is is the persistence bit I am thinking 
about.
It is not easy to tell the name of the downloaded file from the URL.

I could have a file with all the URLs listed and work through each line in turn.
But then I would have to rewrite the file (say, with the previously-successful
lines commented out) as I go.

...

     Thanks for the idea. I should perhaps have said more clearly that it is not
easy (though perhaps not impossible) to infer the name of the downloaded data
from the URL - it is not a 'simple' file URL, more of a tag.

However I guess your scheme would work if I just hashed the URL and created
a marker file - "a39321604c.downloaded" once downloaded. The downloaded content
would be separately (and somewhat opaquely) named, but that doesn't matter.

MRAB's scheme does have the disadvantages to me that Chris has pointed out.


Accordingly, +1 to @Dan's suggestion of a database*:

- it can be structured to act as a queue, for URLs yet to be downloaded

- when downloading starts, the pertinent row can be updated to includethe fileNM in use (a separate field from the URL)- when the download is complete, further update the row with a suitable'flag'- as long as each write/update is commit-ed, the system will beinterrupt-able (^c).

Upon resumption, query the DB looking for entries withoutcompletion-flags, and re-start/resume the download process.

If a downloaded file is (later) found to be corrupt, either add thedetails to the queue again, or remove the 'flag' from the original entry.

This method could also be extended/complicated to work if you (are smartabout) implement multiple retrieval threads...

* NB I don't use SQLite (in favor of going 'full-fat') and thus cannotvouch for its behavior under load/queuing mechanism/concurrentaccesses... but I'm biased and probably think/write SQL more readilythan Python - oops!

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: Suggestions on mechanism or existing code - maintain persistence of file download history

Reply via email to