| zhuyifei1999 created this task. zhuyifei1999 triaged this task as "Normal" priority. zhuyifei1999 added projects: Google-Code-in-2017, Pywikibot-core, Pywikibot-Other-scripts. |
The meaning of atomic here:
(computing) Of an operation: guaranteed to complete either fully or not at all while waiting in a pause, and running synchronously when called by multiple asynchronous threads.
This task shall include two parts:
- Make sure the target file is always a complete file; only commit iff the file is completely downloaded (and verified with checksum), or discard if anything goes wrong.
Consider a scenario where on a machine, a bot that processes the dump starts starts right after the download script starts, and that the dump processor is faster and the downloader. In the current implementation, the processor will eventually find the dump incomplete, and the behavior of the processor will be often undefined (eg: crash).
One implementation is to perform the download on a temporary file. In the wild '<filename>.part' is often used. Move the temporary file over the target file iff the download is complete and verified, or delete the temporary file if it fails. In *nix implementation, currently running processors will not be disrupted as the file descriptors of their opened dump file will continue to point to the old moved-over now-nonexistent file, and will happily read it without ever knowing a new version has arrived, till the file is reopened for another run.
- Make sure two downloaders do not write on the same partially-downloaded file at the same time.
Consider a scenario where two downloaders run at the same time. If their file descriptors point to the same inode, their write(2) calls will go to the same file, corrupting it, and doubling its size. Otherwise, in the case where they point to different inodes but same filename, the suggested implementation in part 1 will fail, as moving a file rename(2) is done on the filename, not the inode. The early-finishing downloader will consider the dump downloaded by the late-starting downloader complete and commit it; in the case that they are different processes, the file will end up in a partially complete state, and dump-consumers go undefined.
One implementation is to perform file locking on whichever is being downloaded on, and exit with an error if the will-be-written file is locked. Another is to avoid same-filenames entirely.
Cc: Framawiki, Aklapper, Xqt, jayvdb, siebrand, Zoranzoki21, eflyjason, pywikibot-bugs-list, zhuyifei1999, Bright1055, Toppole69, Mine0901, Jayprakash12345, Magul, Tbscho, MayS, Beeyan, Mdupont, JJMC89, MtDu, D3r1ck01, Avicennasis, Dalba, Masti, Alchimista, Rxy
_______________________________________________ pywikibot-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs
