-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Matthew Woehlke wrote: > Micah Cowan wrote: >> [stuff about regex matching] > > How will you handle nested boolean expressions? Same as 'find'? > > IOW, how do you do this? > [url matches foo] AND ( [domain matches bar] OR [query matches baz] ) > > (Obviously I am intentionally choosing an example where the 'or' part > can't be easily expressed in the regex.)
Actually, my plan is... not to. The current method for checking accept/reject rules - if it's in the acceptance list, and not in the reject list, it's in - has never garnered any complaints to my knowledge. And technically, such logic is representable in the regex itself; it would just be butt-ugly, and a pain to craft (but only if you wanted the literal case you mention: I can't think of a circumstance where it'd actually be useful). I had thought that, if we really want a robust system, we could do it as a series of result tables, a lá PAM or iptables. But that tends to be vastly more confusing to users. My expectation is that the real use cases for such a thing is going to be incredibly rare. At some point (1.13, I'm hoping?) Wget will provide more options for delegating tasks to outside programs, which is a good deal more Unixy. In that event, we could pawn off acceptability decisions to an awk script or what have you. My primary concern for the immediate future, really, is to start supporting query string-matching, and fix some of the things about accept/reject that I consider fundamentally broken. Regexes were something that had always been attractive, and seems to be a convenient way to address the query string problem at the same time. OTOH, I realize that we want to make it as robust as possible. If someone can come up with a simple and easy-to-use system, I'm interested. >> --no-match ':field:action=(edit|print)' > > Something like 'param[eter]' or 'arg[ument]' seems more sensible to me > (though as a programmer I am not the best to ask about usability > things). Such URL's coming from a form isn't always obvious... and in > some cases is even untrue. "Parameter", at least, suffers from the shortcoming that it forces both itself and path to specify a minimum of three characters to get a unique label. I'd go so far as to say it's frequently the case that such query-string formats don't come from an HTML form; however, as far as I know, HTML forms are the only thing that actually specify that format, and I assume they're directly responsible for the popularity of this representation. CGI itself doesn't lay any constraints or even expectations on what the query string should look like (though most libraries implementing CGI provide facilities for the HTML forms format). However, it could still be termed a "field", I believe, regardless of whether it comes from an HTML form or not. I personally prefer it over either "parameter" or "argument", but I'm willing to hear more opinions. (I might prefer "parameter", actually, if it weren't for the conflict with "path". But I doubt we'll identify a more appropriate name for "path".) >> . Don't follow links for producing printer-ready output, or editing >> pages. Equivalent to --no-match ':query:(.*&)?action=print(&.*)?', >> but somewhat easier to write. > > Just in case you're planning on a conversion to that regex in the code, > remember that it is really: > '^.*[?]([^&]*&)*action=print(&.*)?$' No it's not. The anchors are already implicit, remember? and the ":query:" label means it only tries that regex against the query-string portion of the URL, so the .*[?] would break. If I'd used :url: instead of :query:, then that modification would be necessary (though still without the anchors). > For that matter, if you support '\b', I wonder if you need "components" > at all... I don't see how that would help anywhere save for the "fields" components; but then I gather from the above that you may have been a bit confused about what the effect of the component-selection does. If I were going to support that (and I may), then I'd probably go with \< and \> instead, as that's what seems to be commonly used for EREs, outside of Perl. >> Components may be combined; to match against the combination of path and >> query string, you just specify :path+query:. That could be abbreviated >> as :p+q:. Combinations are only allowed if all the components involved >> are consecutive; :domain+query: (no path) would be illegal. > > I can probably figure out technical reasons for that, but it doesn't > make much sense from a user perspective. Why shouldn't I be able to write: > -z ':d,f:foo' > ...and have it match both > 'http://foobar.com/' > and > 'http://baz.org/index?title=foobar' > ? No. It means that entire regex is matched, once, against the combined components, not matched once for each of the components. There is no sane way to combine only the domain and a field (field would not be allowed to combine with anything, in fact). > BTW, what exactly are the components? Is this right? > > [u]rl: http://foobar.com/site/images/thumb.php?name=baz.jpg&x=64&y=64 > p[r]otocol: "http" > [d]omain: "foobar.com" > [p]ath: "site/images" > [f]ile: "thumb.php" > [q]uery: "name=baz.jpg&x=64&y=64" > [a]rgs: "name=baz.jpg", "x=64", "y=64" This is the diagram I did (but didn't include in the message I sent out). /---\ scheme /------ path ---\ /---------- query ------\ https://addictivecode.org/foo/bar/baz.html?fee=fi&fo=fum&bludtype=en \---- domain ---/ \----/ field The idea would be that "query" would match everything after the question mark (so don't include the mark in your regex), though "path" would include the intial / (it would always have one). The main advantage to that for "path" would be that you can do ".*/index.html"; I can't think of any similar advantage to including the ? in the query string. However, if you specify :path+query:, then the question mark is included. Similarly, :scheme: wouldn't include the "://", but :scheme+domain: would. >> - Avoid adding both a --match and a --no-match option, by making >> negation a flag instead (/n or something: --match 'p/ni:.*\.js' >> would reject any paths ending in any case variant of ".js"). > > Similar ideas: > -z '(?!expr)' This one's of course automatic with PCRE if we provide that with an option; we'd have to "emulate" it in builds not including PCRE. >> - Other anchoring options. I suspect that the many common use cases >> will begin with '.*'. We could remove the implicit anchoring, but >> then we'd probably usually want it at the end, forcing us to write >> the final '$'. That's one character versus two, but my gut tells me >> it's easier to forget anchors than it is to forget "match-any" >> patterns, which is why I lean toward implicit anchors. > > MHO: implicit anchoring violates traditional regex usage. There is > probably an example of implicit anchoring somewhere, but offhand I can't > think of it. (And at any rate, sed/grep sure don't use implicit anchoring.) Both sed and grep use regex as the basis of a _search_. We're not _searching_ for a pattern in a string, we're matching. (Find's manual uses this same reasoning). Additionally, implicit anchoring is obviously unhelpful to sed and grep, because by far the most common use cases want to match anywhere. On the other hand, this reasoning doesn't apply so cleanly to Perl and (especially) Awk, where you can argue that the regex's primary function is for matching, not searching (in Awk, you have to call a separate function afterwards to get the search results; the regex just returns a boolean value). Still, it seems to me we're most closely trying to do what find does, and not what sed, grep, perl or awk do. We're not the least bit interested in transforming the URL afterwards, or remembering match start/end positions*. * Actually, it'd be great to do transformations on URLs, and especially file names. We'll do that eventually, but not via built-in Wget facilities; we'll outsource it to sed or other user-specified commands. > Of course, if you support '\b' (and require explicit anchoring), then it > is somewhat hard to justify args (as you can just use '\bexpr\b' against > query, instead of '^expr$' against args). Not really. \b would falsely match punctuation that can be a legitimate part of a field name and/or value (and, in particular, %XX). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. Maintainer of GNU Wget and GNU Teseq http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkq6hgIACgkQ7M8hyUobTrHvRwCfd7rndVUln9ZMmKJs3Twvx7rf l3gAn2x+t1JTHuKT9xY1YtmtLxLyqXIM =ebCu -----END PGP SIGNATURE-----
