Follow-up Comment #2, bug #45803 (project wget): # Parallel
If wget only fetches things serially... I deferred any parallelism to the sole filter program, in case it wanted to spread out and recombine its decision process into a single logical answer. If wget does fetching in parallel... yes it could spawn checks in parallel, but it would have to be the same program, not prog1 prog2 prog3, else there could be three different results. # CSV " WGET_FILTER_URI. This variable shall contain exactly what is passed to the current commandline regexes today, ie: https://www.example.com/foo/bar?a=b&c=d#123 " I wanted wget to have a base mode where each of the three methods would be fed the same exact string by default, such that the user can test and swap regexes between them equally. Of course wget could feed other things (such as CSV) via enhanced modes to the filter program which could in turn do anything it wants. # Referer, etc " the following optional set of variables should also be passed to the program if readily implementable today (each of them can result in different serving hierarchy contexts " What would WGET_FILTER_REFERER be used for? Yes, referers, time of day, agent, dynamic pages, etc can all serve up different content... those are typically *logical* differences within the same service instance. The variables I put in this section are the typical set used in server side configs to present entirely different *physical* hierarchies of data or server [virtual] instances and apply to both HTTP and FTP. (Wget is currently dumb about that regarding its on disk storage... it doesn't encode such info into the basedir pathname and thus will clobber itself by physically merging multiple contexts on disk during recursive spidering. That's a wget design failure to fix.) Thus if logical things like referer are felt needed, I'd rather see the entire set of client request headers to the server be stuffed into this CSV you speak of as WGET_FILTER_CLIREQ_CSV. I also wanted the input passing mechanism to be via environment variables since novice scripters and coders can use those but may not yet know how to process standard input (or the filesystem) which would prevent them from using --uri-filter-prog. I'm not keen on passing more things via the filesystem unless wget's other metafile handling (such as cookies, logs, and even a future "resume full prior state of crawl") is also cleaned up in the process. By this I mean that there should probably be some control flags such as --statefile-basedir and --statefile-basedir-auto that will put all these statefiles under one dir (optionally auto mkstemp), and under default filenames. The filesystem is also slower, but could be useful in other ways. It would be possible to support multiple input methods with: --uri-filter-prog-type=env-basic:env-phys:stdin-req:fs-csv env-basic: WGET_FILTER_URI env-phys: my full set of vars stdin-req: the entire client request via stdin fs-csv: client request via filesystem ...: and other permutations The idea was to keep it simple enough to get the three feature enhancements out to people quickly. For the first, I put setenv() and system() at utils.c:949 http://git.savannah.gnu.org/cgit/wget.git/tree/src/utils.c rev: 52228516b5d00c1dcf3623c4e3250490d1eb1d60 I added exit status 2 to the spec as reserved. The program may utilize it as an exit catchall for URI's that fall through its explicit accept / reject checks, to whatever the default sense is, as set with --uri-filter-prog-default=accept:reject, default reject. WGET_FILTER_HOST should be as in the original, no DNS conversion. Feel free to run with it as desired, it should be readily expandable to anyone's needs. _______________________________________________________ Reply to this item at: <http://savannah.gnu.org/bugs/?45803> _______________________________________________ Message sent via/by Savannah http://savannah.gnu.org/
