Re: [PLUG] question on linux tool to clean URLs

Ben Koenig Wed, 06 Feb 2019 17:32:45 -0800

I don't mean to argue against sed/awk here, it's just that the thing you
need to remember is that while yes, all parameters should start with a '?',
they don't have to. This is nothing more than a convention used by most
webservers, and not a technical requirement.
So the problem with doing a basic script that clips everything from the ?
on, you risk being left with extra crap you don't want in the event that a
webserver does something custom.


In a way, the OP's questions was somewhat vague. If he can guarantee that
the URL will be following W3C standards, then yes by all means use a 1 line
sed script. If he wants to clean a given URL from anywhere on the internet,
then he needs a more comprehensive URL handling tool. To give an exact
answer we really do need more information about what he is trying to
accomplish.

Again, I'm not trying to devalue the power of sed or awk here. I know how
easy it would be to split on the ? and then just print the first part.
URL's are much more complex than that, and without comprehensive URL
handling you open yourself up to malformed URL attacks. Trying to solve a
complex puzzle with 1-liners is literally how exploits start....

Just trying to make the Net a safer place by re-using open-source code
written by software devs infinitely more experienced than myself.



On Wed, Feb 6, 2019 at 5:02 PM John Sechrest <[email protected]> wrote:

> I am sure this is buildable with a one line perl script. Probably with SED
> as well. Depends on the level of cleaning you want.
>
> Likely, you get 90% of the way Judy cutting off everything after the ? In
> the URL ... Including the ?
>
> On Wed, Feb 6, 2019, 4:52 PM Ben Koenig <[email protected] wrote:
>
> > I don't know of a tool that does this, but URL formatting is common for a
> > lot of programming tasks. If you know python, setting up a small script
> > that returns specific pieces of a URL is trivial.
> >
> > https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse
> >
> > Qt5 (and probably GTK too ) has similar URL parsing mechanisms, and you
> > could probably find similar functionality in most high-level scripting
> > languages through the appropriate module or library. Now whether or not a
> > tool already exists that does this in a production friendly way...
> probably
> > not, just example apps and code.  The 'QUrl' object within Qt5 does a
> nice
> > job of abstracting the components of a network location in C++ so there
> > might be someone who threw up a quick little demo app on github.
> >
> >
> >
> > On Tue, Feb 5, 2019 at 8:50 PM David Barr <[email protected]> wrote:
> >
> > > Hey, Randall,
> > >
> > > To be pedantic, the tracking tags and such are all stuff that appear
> > > after the question mark delimiting character in the HTTP PUT request,
> > > right? `https://foo/bar/baz?evil_tag=evil`
> <https://foo/bar/baz?evil_tag=evil>
> > <https://foo/bar/baz?evil_tag=evil>
> > > <https://foo/bar/baz?evil_tag=evil>
> > >
> > > The trick then, is to select only the lines containing question marks,
> > > and then delete from the question mark to the end of the line. Try
> this:
> > >
> > > ```
> > > sed -e '/\?/ s/\?.*$//' <file>
> > > ```
> > >
> > > Pedantry again: That's "select lines containing a (backslash escaped)
> > > question mark," followed by "substitute all characters from and
> > > including that (backslash escaped) question mark to the end of the line
> > > ($) with nothing."
> > >
> > > I haven't tested this on a file, so I deserve whatever mockery I get if
> > > I missed something.
> > >
> > > Cheers!
> > > David
> > >
> > > On 2/5/19 2:48 PM, logical american wrote:
> > > > Hi:
> > > >
> > > > Is there a linux tool which cleans up the URLs in a text file (I
> > > > believe Western unicode encoding) so that all the tracking tags,
> > > > fbclid, etc are removed and the pure URL is left in the text?
> > > >
> > > > In one recent email I received, there were 28 govdelivery.com tags
> and
> > > > others embedded inside the URLs, and I don't wish the posted material
> > > > to provide an easy access for the website to be tracked.
> > > >
> > > > Thanks
> > > >
> > > > Randall
> > > >
> > >
> > > _______________________________________________
> > > PLUG mailing list
> > > [email protected]
> > > http://lists.pdxlinux.org/mailman/listinfo/plug
> > >
> > _______________________________________________
> > PLUG mailing list
> > [email protected]
> > http://lists.pdxlinux.org/mailman/listinfo/plug
> >
> _______________________________________________
> PLUG mailing list
> [email protected]
> http://lists.pdxlinux.org/mailman/listinfo/plug
>
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Re: [PLUG] question on linux tool to clean URLs

Reply via email to