I am sure this is buildable with a one line perl script. Probably with SED
as well. Depends on the level of cleaning you want.

Likely, you get 90% of the way Judy cutting off everything after the ? In
the URL ... Including the ?

On Wed, Feb 6, 2019, 4:52 PM Ben Koenig <[email protected] wrote:

> I don't know of a tool that does this, but URL formatting is common for a
> lot of programming tasks. If you know python, setting up a small script
> that returns specific pieces of a URL is trivial.
>
> https://docs.python.org/3/library/urllib.parse.html#module-urllib.parse
>
> Qt5 (and probably GTK too ) has similar URL parsing mechanisms, and you
> could probably find similar functionality in most high-level scripting
> languages through the appropriate module or library. Now whether or not a
> tool already exists that does this in a production friendly way... probably
> not, just example apps and code.  The 'QUrl' object within Qt5 does a nice
> job of abstracting the components of a network location in C++ so there
> might be someone who threw up a quick little demo app on github.
>
>
>
> On Tue, Feb 5, 2019 at 8:50 PM David Barr <[email protected]> wrote:
>
> > Hey, Randall,
> >
> > To be pedantic, the tracking tags and such are all stuff that appear
> > after the question mark delimiting character in the HTTP PUT request,
> > right? `https://foo/bar/baz?evil_tag=evil`
> <https://foo/bar/baz?evil_tag=evil>
> > <https://foo/bar/baz?evil_tag=evil>
> >
> > The trick then, is to select only the lines containing question marks,
> > and then delete from the question mark to the end of the line. Try this:
> >
> > ```
> > sed -e '/\?/ s/\?.*$//' <file>
> > ```
> >
> > Pedantry again: That's "select lines containing a (backslash escaped)
> > question mark," followed by "substitute all characters from and
> > including that (backslash escaped) question mark to the end of the line
> > ($) with nothing."
> >
> > I haven't tested this on a file, so I deserve whatever mockery I get if
> > I missed something.
> >
> > Cheers!
> > David
> >
> > On 2/5/19 2:48 PM, logical american wrote:
> > > Hi:
> > >
> > > Is there a linux tool which cleans up the URLs in a text file (I
> > > believe Western unicode encoding) so that all the tracking tags,
> > > fbclid, etc are removed and the pure URL is left in the text?
> > >
> > > In one recent email I received, there were 28 govdelivery.com tags and
> > > others embedded inside the URLs, and I don't wish the posted material
> > > to provide an easy access for the website to be tracked.
> > >
> > > Thanks
> > >
> > > Randall
> > >
> >
> > _______________________________________________
> > PLUG mailing list
> > [email protected]
> > http://lists.pdxlinux.org/mailman/listinfo/plug
> >
> _______________________________________________
> PLUG mailing list
> [email protected]
> http://lists.pdxlinux.org/mailman/listinfo/plug
>
_______________________________________________
PLUG mailing list
[email protected]
http://lists.pdxlinux.org/mailman/listinfo/plug

Reply via email to