Re: well, blew it... sed or perl q again.
On Tue, 30 Dec 2008 11:31:14 -0800, Gary Kline kl...@thought.org said: G The problem is that there are many, _many_ embedded A G HREF=http://whatever Site/A in my hundreds, or thousands, or G files. I only want to delete the http://junkfoo.com lines, _not_ G the other Href links. Use perl. You'll want the i option to do case-insensitive matching, plus m for matching that could span multiple lines; the first quoted line above shows one of several places where a URL can cross a line-break. You might want to leave the originals completely alone. I never trust programs to modify files in place: you% mkdir /tmp/work you% find . -type f -print | xargs grep -li http://junkfoo.com FILES you% pax -rwdv -pe /tmp/work FILES Your perl script can just read FILES and overwrite the stuff in the new directory. You'll want to slurp the entire file into memory so you catch any URL that spans multiple lines. Try the script below, it works for input like this: This a HREF=http://junkfoo.com; Site/A should go away too. And so should a HREF= http://junkfoo.com/; Site/A this And finally a HREF=http://junkfoo.com/;Site/A this -- Karl Vogel I don't speak for the USAF or my company The average person falls asleep in seven minutes. --item for a lull in conversation --- #!/usr/bin/perl -w use strict; my $URL = 'href=(.*?)http://junkfoo.com/*;'; my $contents; my $fh; my $infile; my $outfile; while () { chomp; $infile = $_; s{^./}{/tmp/}; $outfile = $_; open ($fh, $infile) or die $infile; $contents = do { local $/; $fh }; close ($fh); $contents =~ s{ # substitute ... a(.*?) # ... URL start $URL # ... actual link (.*?)# ... min # of chars including newline /a # ... until we end } { }gixms; # ... with a single space open ($fh, $outfile) or die $outfile; print $fh $contents; close ($fh); } exit(0); ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
On Wed, Dec 31, 2008 at 03:20:14PM -0500, Karl Vogel wrote: On Tue, 30 Dec 2008 11:31:14 -0800, Gary Kline kl...@thought.org said: G The problem is that there are many, _many_ embedded A G HREF=http://whatever Site/A in my hundreds, or thousands, or G files. I only want to delete the http://junkfoo.com lines, _not_ G the other Href links. Use perl. You'll want the i option to do case-insensitive matching, plus m for matching that could span multiple lines; the first quoted line above shows one of several places where a URL can cross a line-break. You might want to leave the originals completely alone. I never trust programs to modify files in place: you% mkdir /tmp/work you% find . -type f -print | xargs grep -li http://junkfoo.com FILES you% pax -rwdv -pe /tmp/work FILES ^^^ pax is like cpio, isn't it? anyway, yes, i'll ponder this. i [mis]-spent hours undoing something bizarre that my scrub.c binary did to directories, turning foo and bar, (and scores more) into foo and foo.bar, bar and bar.bak. the bak were the saved directories. the foo, bar were bizarre. i couldn't write/cp/mv over them. had to carefully rm -f foo; mv foo.bar foo [et cetera].. then i scp'd my files to two other computers. (*mumcle) Your perl script can just read FILES and overwrite the stuff in the new directory. You'll want to slurp the entire file into memory so you catch any URL that spans multiple lines. Try the script below, it works for input like this: This a HREF=http://junkfoo.com; Site/A should go away too. And so should a HREF= http://junkfoo.com/; Site/A this And finally a HREF=http://junkfoo.com/;Site/A this -- Karl Vogel I don't speak for the USAF or my company The average person falls asleep in seven minutes. --item for a lull in conversation --- #!/usr/bin/perl -w use strict; my $URL = 'href=(.*?)http://junkfoo.com/*;'; my $contents; my $fh; my $infile; my $outfile; while () { chomp; $infile = $_; s{^./}{/tmp/}; $outfile = $_; open ($fh, $infile) or die $infile; $contents = do { local $/; $fh }; close ($fh); $contents =~ s{ # substitute ... a(.*?) # ... URL start $URL # ... actual link (.*?)# ... min # of chars including newline /a # ... until we end } { }gixms; # ... with a single space open ($fh, $outfile) or die $outfile; print $fh $contents; close ($fh); } exit(0); ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.17a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. Which would be best to use, given that a backup is critical? sed or perl? IMHO, perl with the -i option to do in-place editing with backups. You could also use the -p option to loop over files. See perlrun(1). Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgp0lBFjVoSUO.pgp Description: PGP signature
Re: well, blew it... sed or perl q again.
On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote: On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. Which would be best to use, given that a backup is critical? sed or perl? IMHO, perl with the -i option to do in-place editing with backups. You could also use the -p option to loop over files. See perlrun(1). Roland All right, then is this the right syntax. In other words, do I need the double quotes to match the http: string? perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' * gary -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.17a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
On Tue, Dec 30, 2008 at 12:51:31PM -0800, Gary Kline wrote: On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote: On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. Which would be best to use, given that a backup is critical? sed or perl? IMHO, perl with the -i option to do in-place editing with backups. You could also use the -p option to loop over files. See perlrun(1). Roland All right, then is this the right syntax. In other words, do I need the double quotes to match the http: string? perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' * In years past I used fetch(1) to download the day's page from a comic strip site, awk to extract the URL of the day's comic strip, and fetch again to put a copy of the comic strip in my archive. This application sounds similar. -- David Kelly N4HHE, dke...@hiwaay.net Whom computers would destroy, they must first drive mad. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
On Tue, 30 Dec 2008 12:51:31 -0800, Gary Kline kl...@thought.org wrote: All right, then is this the right syntax. In other words, do I need the double quotes to match the http: string? perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' * Close, but not exactly right... You have to keep in mind that the argument to -e is a Perl expression, i.e. something you might type as part of a script that looks like this: #!/usr/bin/perl while (STDIN) { YOUR-EXPRESSION-HERE; print $_; } One of the ways to print only the lines that do *not* match the http://; pattern is: print unless (m/http:\/\//); Note how the '/' characters that are part of the m/.../ expression need extra backslashes to quote them. You can avoid this by using another character for the m/.../ expression delimiter, like: print unless (m!http://!); But you are not still done. The while loop above already contains a print statement _outside_ of your expression. So if you add this to a perl -p -e '...' invocation you are asking Perl to run this code: #!/usr/bin/perl while (STDIN) { print unless (m!http://!); print $_; } Each line of input will be printed _anyway_, but you will be duplicating all the non-http lines. Use -n instead of -p to fix that: perl -n -e 'print unless (m!http://!)' A tiny detail that may be useful is that http://; is not required to be lowercase in URIs. It may be worth adding the 'i' modifier after the second '!' of the URI matching expression: perl -n -e 'print unless (m!http://!i)' Once you have that sort-of-working, it may be worth investigating more elaborate URI matching regexps, because this will match far too much (including, for instance, all the non-URI lines of this email that contain the regexp example itself). ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
Hi Gary, Am Dienstag, 30. Dez 2008, 11:31:14 -0800 schrieb Gary Kline: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. sed or perl? Ruby. Untested: $ ruby -i.bak -pe 'next if ~/href=([^]*)/i and $1 == http://example.com;' somefile.html Probably you want to do something more sophisticated. Bertram -- Bertram Scharpf Stuttgart, Deutschland/Germany http://www.bertram-scharpf.de ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
On Tue, Dec 30, 2008 at 12:51:31PM -0800, Gary Kline wrote: On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote: On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. Which would be best to use, given that a backup is critical? sed or perl? IMHO, perl with the -i option to do in-place editing with backups. You could also use the -p option to loop over files. See perlrun(1). Roland All right, then is this the right syntax. In other words, do I need the double quotes to match the http: string? perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' * You don't need the quotes (if the command doesn't contain anything that your shell would eat/misuse/replace). See perlop(1). This will disregard the entire line with a URI in it. Is this really what you want? Copy some of the files you want to scrub to a separate directory, and run tests to see if your script works: mkdir mytest; cp foo mytest/; cd mytest; perl -pi.bak ../scrub.pl foo diff -u foo foo.bak Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) pgpi5VZb94nko.pgp Description: PGP signature
Re: well, blew it... sed or perl q again.
On Tue, Dec 30, 2008 at 11:07:05PM +0200, Giorgos Keramidas wrote: On Tue, 30 Dec 2008 12:51:31 -0800, Gary Kline kl...@thought.org wrote: All right, then is this the right syntax. In other words, do I need the double quotes to match the http: string? perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' * Close, but not exactly right... You have to keep in mind that the argument to -e is a Perl expression, i.e. something you might type as part of a script that looks like this: #!/usr/bin/perl while (STDIN) { YOUR-EXPRESSION-HERE; print $_; } One of the ways to print only the lines that do *not* match the http://; pattern is: print unless (m/http:\/\//); Note how the '/' characters that are part of the m/.../ expression need extra backslashes to quote them. You can avoid this by using another character for the m/.../ expression delimiter, like: i've used '%' rather than bangs because i wasn't sure if the bang might make the shell have a fit; great to know it won't:_) [i try to avoid escapes when i can... .] print unless (m!http://!); But you are not still done. The while loop above already contains a print statement _outside_ of your expression. So if you add this to a perl -p -e '...' invocation you are asking Perl to run this code: #!/usr/bin/perl while (STDIN) { print unless (m!http://!); print $_; } Each line of input will be printed _anyway_, but you will be duplicating all the non-http lines. Use -n instead of -p to fix that: perl -n -e 'print unless (m!http://!)' ahhhm, that's what happened last night. i would up with dup lines (2) pointing me to different links. had no clue. fortunately i had the .bak! A tiny detail that may be useful is that http://; is not required to be lowercase in URIs. It may be worth adding the 'i' modifier after the second '!' of the URI matching expression: perl -n -e 'print unless (m!http://!i)' Once you have that sort-of-working, it may be worth investigating more elaborate URI matching regexps, because this will match far too much (including, for instance, all the non-URI lines of this email that contain the regexp example itself). i shall check try a grep -ri http * /usr/tmp/g.out and see what turns up. thanks much, gary -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.17a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
On Tue, Dec 30, 2008 at 10:16:42PM +0100, Roland Smith wrote: On Tue, Dec 30, 2008 at 12:51:31PM -0800, Gary Kline wrote: On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote: On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. Which would be best to use, given that a backup is critical? sed or perl? IMHO, perl with the -i option to do in-place editing with backups. You could also use the -p option to loop over files. See perlrun(1). Roland All right, then is this the right syntax. In other words, do I need the double quotes to match the http: string? perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' * You don't need the quotes (if the command doesn't contain anything that your shell would eat/misuse/replace). See perlop(1). i have, thanks; getting more clues... . This will disregard the entire line with a URI in it. Is this really what you want? exactly; anything that has http://WHATEVER i do not want to copy. the slight gotcha is if the /A LIne // tag is on the folowing line. but in most cases the whole anchor, close-anchor of these junk lines is on one line. ...i know a closing tag does nothing; it's just sloppy markup. Copy some of the files you want to scrub to a separate directory, and run tests to see if your script works: mkdir mytest; cp foo mytest/; cd mytest; perl -pi.bak ../scrub.pl foo diff -u foo foo.bak thanks much to you and giorgos. i thought about doing this by-hand, but only for about 0.01s! gary Roland -- R.F.Smith http://www.xs4all.nl/~rsmith/ [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated] pgp: 1A2B 477F 9970 BA3C 2914 B7CE 1277 EFB0 C321 A725 (KeyID: C321A725) -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.17a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
On Tue, Dec 30, 2008 at 10:16:33PM +0100, Bertram Scharpf wrote: Hi Gary, Am Dienstag, 30. Dez 2008, 11:31:14 -0800 schrieb Gary Kline: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. sed or perl? Ruby. Untested: $ ruby -i.bak -pe 'next if ~/href=([^]*)/i and $1 == http://example.com;' somefile.html Probably you want to do something more sophisticated. Bertram Hi Bertram, Well, after about 45 minutes of mousing cut/paste/edit, plus editing scripts, i ain't there yet. if i use the perl -e 'print unless /m/http:/ || eof; close ARGV if eof' *.htm no errors, but the new.htm is == new.htm.bak; in other words, it looks like a partial match on just http fails. Don't know why. i'm pretty sure the entire A HREF=http://foobar.com; xxx /A would do it. roland, the dbl quote were necessary it seems. maybe i'll try parens. gary -- Bertram Scharpf Stuttgart, Deutschland/Germany http://www.bertram-scharpf.de -- Gary Kline kl...@thought.org http://www.thought.org Public Service Unix http://jottings.thought.org http://transfinite.thought.org The 2.17a release of Jottings: http://jottings.thought.org/index.php ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org
Re: well, blew it... sed or perl q again.
Hi Gary, Am Dienstag, 30. Dez 2008, 17:48:02 -0800 schrieb Gary Kline: On Tue, Dec 30, 2008 at 10:16:33PM +0100, Bertram Scharpf wrote: Hi Gary, Am Dienstag, 30. Dez 2008, 11:31:14 -0800 schrieb Gary Kline: The problem is that there are many, _many_ embedded A HREF=http://whatever Site/A in my hundreds, or thousands, or files. I only want to delete the http://junkfoo.com lines, _not_ the other Href links. sed or perl? Ruby. Untested: $ ruby -i.bak -pe 'next if ~/href=([^]*)/i and $1 == http://example.com;' somefile.html Probably you want to do something more sophisticated. no errors, but the new.htm is == new.htm.bak; in other words, it looks like a partial match on just http fails. Don't know why. i'm pretty sure the entire A HREF=http://foobar.com; xxx /A would do it. This is not FreeBSD-specific, though. I still wonder why you rely on lines just containing %r{^A.*.*/A$} . Maybe you're doing a quick'n'dirty solution but I'm quite sure you won't get along with a one-liner. Bertram -- Bertram Scharpf Stuttgart, Deutschland/Germany http://www.bertram-scharpf.de ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org