Re: well, blew it... sed or perl q again.

2008-12-31 Thread Karl Vogel
 On Tue, 30 Dec 2008 11:31:14 -0800, 
 Gary Kline kl...@thought.org said:

G The problem is that there are many, _many_ embedded A
G HREF=http://whatever Site/A in my hundreds, or thousands, or
G files.  I only want to delete the http://junkfoo.com lines, _not_
G the other Href links.

   Use perl.  You'll want the i option to do case-insensitive matching,
   plus m for matching that could span multiple lines; the first
   quoted line above shows one of several places where a URL can cross
   a line-break.

   You might want to leave the originals completely alone.  I never trust
   programs to modify files in place:

 you% mkdir /tmp/work
 you% find . -type f -print | xargs grep -li http://junkfoo.com  FILES
 you% pax -rwdv -pe /tmp/work  FILES

   Your perl script can just read FILES and overwrite the stuff in the new
   directory.  You'll want to slurp the entire file into memory so you catch
   any URL that spans multiple lines.  Try the script below, it works for
   input like this:

  This
  a HREF=http://junkfoo.com;
 Site/A should go away too.

  And so should
  a HREF=
http://junkfoo.com/;
   Site/A this

  And finally a HREF=http://junkfoo.com/;Site/A this

-- 
Karl Vogel  I don't speak for the USAF or my company

The average person falls asleep in seven minutes.
--item for a lull in conversation

---
#!/usr/bin/perl -w

use strict;

my $URL = 'href=(.*?)http://junkfoo.com/*;';
my $contents;
my $fh;
my $infile;
my $outfile;

while () {
chomp;
$infile = $_;

s{^./}{/tmp/};
$outfile = $_;

open ($fh,  $infile) or die $infile;
$contents = do { local $/; $fh };
close ($fh);

$contents =~ s{  # substitute ...
a(.*?)  # ... URL start
$URL # ... actual link
(.*?)# ... min # of chars including newline
/a # ... until we end
  }
  { }gixms;  # ... with a single space

open ($fh,  $outfile) or die $outfile;
print $fh $contents;
close ($fh);
}

exit(0);
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-31 Thread Gary Kline
On Wed, Dec 31, 2008 at 03:20:14PM -0500, Karl Vogel wrote:
  On Tue, 30 Dec 2008 11:31:14 -0800, 
  Gary Kline kl...@thought.org said:
 
 G The problem is that there are many, _many_ embedded A
 G HREF=http://whatever Site/A in my hundreds, or thousands, or
 G files.  I only want to delete the http://junkfoo.com lines, _not_
 G the other Href links.
 
Use perl.  You'll want the i option to do case-insensitive matching,
plus m for matching that could span multiple lines; the first
quoted line above shows one of several places where a URL can cross
a line-break.
 
You might want to leave the originals completely alone.  I never trust
programs to modify files in place:
 
  you% mkdir /tmp/work
  you% find . -type f -print | xargs grep -li http://junkfoo.com  FILES
  you% pax -rwdv -pe /tmp/work  FILES
^^^

pax is like cpio, isn't it?

anyway, yes, i'll ponder this.  i [mis]-spent hours
undoing something bizarre that my scrub.c binary did
to directories, turning foo and bar, (and scores
more)
into foo and foo.bar, bar and bar.bak.  the bak were 
the saved directories.  the foo, bar were bizarre. i
couldn't write/cp/mv over them.  had to carefully 
rm -f foo; mv foo.bar foo [et cetera]..

then i scp'd my files to two other computers.
(*mumcle)

 
Your perl script can just read FILES and overwrite the stuff in the new
directory.  You'll want to slurp the entire file into memory so you catch
any URL that spans multiple lines.  Try the script below, it works for
input like this:
 
   This
   a HREF=http://junkfoo.com;
  Site/A should go away too.
 
   And so should
   a HREF=
 http://junkfoo.com/;
Site/A this
 
   And finally a HREF=http://junkfoo.com/;Site/A this
 
 -- 
 Karl Vogel  I don't speak for the USAF or my company
 
 The average person falls asleep in seven minutes.
 --item for a lull in conversation
 
 ---
 #!/usr/bin/perl -w
 
 use strict;
 
 my $URL = 'href=(.*?)http://junkfoo.com/*;';
 my $contents;
 my $fh;
 my $infile;
 my $outfile;
 
 while () {
 chomp;
 $infile = $_;
 
 s{^./}{/tmp/};
 $outfile = $_;
 
 open ($fh,  $infile) or die $infile;
 $contents = do { local $/; $fh };
 close ($fh);
 
 $contents =~ s{  # substitute ...
 a(.*?)  # ... URL start
 $URL # ... actual link
 (.*?)# ... min # of chars including newline
 /a # ... until we end
   }
   { }gixms;  # ... with a single space
 
 open ($fh,  $outfile) or die $outfile;
 print $fh $contents;
 close ($fh);
 }
 
 exit(0);
 ___
 freebsd-questions@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-questions
 To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org

-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.17a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Roland Smith
On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote:
   The problem is that there are many, _many_ embedded 
   A HREF=http://whatever Site/A in my hundreds, or
   thousands, or files.  I only want to delete the
   http://junkfoo.com lines, _not_ the other Href links.
 
   Which would be best to use, given that a backup is critical?
   sed or perl?

IMHO, perl with the -i option to do in-place editing with backups. You
could also use the -p option to loop over files. See perlrun(1).

Roland
-- 
R.F.Smith   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


pgp0lBFjVoSUO.pgp
Description: PGP signature


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Gary Kline
On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote:
 On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote:
  The problem is that there are many, _many_ embedded 
  A HREF=http://whatever Site/A in my hundreds, or
  thousands, or files.  I only want to delete the
  http://junkfoo.com lines, _not_ the other Href links.
  
  Which would be best to use, given that a backup is critical?
  sed or perl?
 
 IMHO, perl with the -i option to do in-place editing with backups. You
 could also use the -p option to loop over files. See perlrun(1).
 
 Roland


All right, then is this the right syntax.  In other words, do
I need the double quotes to match the http: string?

  perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' *

gary


 -- 
 R.F.Smith   http://www.xs4all.nl/~rsmith/
 [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
 pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)



-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.17a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread David Kelly
On Tue, Dec 30, 2008 at 12:51:31PM -0800, Gary Kline wrote:
 On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote:
  On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote:
 The problem is that there are many, _many_ embedded 
 A HREF=http://whatever Site/A in my hundreds, or
 thousands, or files.  I only want to delete the
 http://junkfoo.com lines, _not_ the other Href links.
   
 Which would be best to use, given that a backup is critical?
 sed or perl?
  
  IMHO, perl with the -i option to do in-place editing with backups. You
  could also use the -p option to loop over files. See perlrun(1).
  
  Roland
 
 
   All right, then is this the right syntax.  In other words, do
   I need the double quotes to match the http: string?
 
   perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' *

In years past I used fetch(1) to download the day's page from a comic
strip site, awk to extract the URL of the day's comic strip, and fetch
again to put a copy of the comic strip in my archive. This application
sounds similar.

-- 
David Kelly N4HHE, dke...@hiwaay.net

Whom computers would destroy, they must first drive mad.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Giorgos Keramidas
On Tue, 30 Dec 2008 12:51:31 -0800, Gary Kline kl...@thought.org wrote:
   All right, then is this the right syntax.  In other words, do
   I need the double quotes to match the http: string?

   perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' *

Close, but not exactly right...

You have to keep in mind that the argument to -e is a Perl expression,
i.e. something you might type as part of a script that looks like this:

#!/usr/bin/perl

while (STDIN) {
YOUR-EXPRESSION-HERE;
print $_;
}

One of the ways to print only the lines that do *not* match the
http://; pattern is:

print unless (m/http:\/\//);

Note how the '/' characters that are part of the m/.../ expression need
extra backslashes to quote them.  You can avoid this by using another
character for the m/.../ expression delimiter, like:

print unless (m!http://!);

But you are not still done.  The while loop above already contains a
print statement _outside_ of your expression.  So if you add this to a
perl -p -e '...' invocation you are asking Perl to run this code:

#!/usr/bin/perl

while (STDIN) {
print unless (m!http://!);
print $_;
}

Each line of input will be printed _anyway_, but you will be duplicating
all the non-http lines.  Use -n instead of -p to fix that:

perl -n -e 'print unless (m!http://!)'

A tiny detail that may be useful is that http://; is not required to be
lowercase in URIs.  It may be worth adding the 'i' modifier after the
second '!' of the URI matching expression:

perl -n -e 'print unless (m!http://!i)'

Once you have that sort-of-working, it may be worth investigating more
elaborate URI matching regexps, because this will match far too much
(including, for instance, all the non-URI lines of this email that
contain the regexp example itself).

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Bertram Scharpf
Hi Gary,

Am Dienstag, 30. Dez 2008, 11:31:14 -0800 schrieb Gary Kline:
 The problem is that there are many, _many_ embedded 
 A HREF=http://whatever Site/A in my hundreds, or
 thousands, or files.  I only want to delete the
 http://junkfoo.com lines, _not_ the other Href links.

 sed or perl?

Ruby. Untested:

  $ ruby -i.bak -pe 'next if ~/href=([^]*)/i and $1 == http://example.com;' 
somefile.html

Probably you want to do something more sophisticated.

Bertram


-- 
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Roland Smith
On Tue, Dec 30, 2008 at 12:51:31PM -0800, Gary Kline wrote:
 On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote:
  On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote:
 The problem is that there are many, _many_ embedded 
 A HREF=http://whatever Site/A in my hundreds, or
 thousands, or files.  I only want to delete the
 http://junkfoo.com lines, _not_ the other Href links.
   
 Which would be best to use, given that a backup is critical?
 sed or perl?
  
  IMHO, perl with the -i option to do in-place editing with backups. You
  could also use the -p option to loop over files. See perlrun(1).
  
  Roland
 
 
   All right, then is this the right syntax.  In other words, do
   I need the double quotes to match the http: string?
 
   perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' *

You don't need the quotes (if the command doesn't contain anything that
your shell would eat/misuse/replace). See perlop(1).  

This will disregard the entire line with a URI in it. Is this really
what you want?

Copy some of the files you want to scrub to a separate directory, and
run tests to see if your script works:

  mkdir mytest; cp foo mytest/; cd mytest; perl -pi.bak ../scrub.pl foo
  diff -u foo foo.bak 

Roland
-- 
R.F.Smith   http://www.xs4all.nl/~rsmith/
[plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)


pgpi5VZb94nko.pgp
Description: PGP signature


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Gary Kline
On Tue, Dec 30, 2008 at 11:07:05PM +0200, Giorgos Keramidas wrote:
 On Tue, 30 Dec 2008 12:51:31 -0800, Gary Kline kl...@thought.org wrote:
  All right, then is this the right syntax.  In other words, do
  I need the double quotes to match the http: string?
 
perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' *
 
 Close, but not exactly right...
 
 You have to keep in mind that the argument to -e is a Perl expression,
 i.e. something you might type as part of a script that looks like this:
 
 #!/usr/bin/perl
 
 while (STDIN) {
 YOUR-EXPRESSION-HERE;
 print $_;
 }
 
 One of the ways to print only the lines that do *not* match the
 http://; pattern is:
 
 print unless (m/http:\/\//);
 
 Note how the '/' characters that are part of the m/.../ expression need
 extra backslashes to quote them.  You can avoid this by using another
 character for the m/.../ expression delimiter, like:


i've used '%' rather than bangs because i wasn't sure if the
bang might make the shell have a fit; great to know it
won't:_)   [i try to avoid escapes when i can... .]


 
 print unless (m!http://!);
 
 But you are not still done.  The while loop above already contains a
 print statement _outside_ of your expression.  So if you add this to a
 perl -p -e '...' invocation you are asking Perl to run this code:
 
 #!/usr/bin/perl
 
 while (STDIN) {
 print unless (m!http://!);
 print $_;
 }
 
 Each line of input will be printed _anyway_, but you will be duplicating
 all the non-http lines.  Use -n instead of -p to fix that:
 
 perl -n -e 'print unless (m!http://!)'
 


ahhhm, that's what happened last night.  i would up with dup
lines (2) pointing me to different links.  had no clue.
fortunately i had the .bak!


 A tiny detail that may be useful is that http://; is not required to be
 lowercase in URIs.  It may be worth adding the 'i' modifier after the
 second '!' of the URI matching expression:
 
 perl -n -e 'print unless (m!http://!i)'
 
 Once you have that sort-of-working, it may be worth investigating more
 elaborate URI matching regexps, because this will match far too much
 (including, for instance, all the non-URI lines of this email that
 contain the regexp example itself).

i shall check try a grep -ri http * /usr/tmp/g.out
and see what turns up.  thanks much,

gary

 

-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.17a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Gary Kline
On Tue, Dec 30, 2008 at 10:16:42PM +0100, Roland Smith wrote:
 On Tue, Dec 30, 2008 at 12:51:31PM -0800, Gary Kline wrote:
  On Tue, Dec 30, 2008 at 09:16:23PM +0100, Roland Smith wrote:
   On Tue, Dec 30, 2008 at 11:31:14AM -0800, Gary Kline wrote:
The problem is that there are many, _many_ embedded 
A HREF=http://whatever Site/A in my hundreds, or
thousands, or files.  I only want to delete the
http://junkfoo.com lines, _not_ the other Href links.

Which would be best to use, given that a backup is critical?
sed or perl?
   
   IMHO, perl with the -i option to do in-place editing with backups. You
   could also use the -p option to loop over files. See perlrun(1).
   
   Roland
  
  
  All right, then is this the right syntax.  In other words, do
  I need the double quotes to match the http: string?
  
perl -pi.bak -e 'print unless /m/http:/ || eof; close ARGV if eof' *
 
 You don't need the quotes (if the command doesn't contain anything that
 your shell would eat/misuse/replace). See perlop(1).  


i have, thanks; getting more clues... .
 
 This will disregard the entire line with a URI in it. Is this really
 what you want?

exactly; anything that has http://WHATEVER i do not want to
copy.  the slight gotcha is if the /A LIne // tag is on the
folowing line.  but in most cases the whole anchor,
close-anchor of these junk lines is on one line.   ...i know
a closing tag does nothing; it's just sloppy markup.


 
 Copy some of the files you want to scrub to a separate directory, and
 run tests to see if your script works:
 
   mkdir mytest; cp foo mytest/; cd mytest; perl -pi.bak ../scrub.pl foo
   diff -u foo foo.bak 

thanks much to you and giorgos.  i thought about doing this
by-hand, but only for about 0.01s!

gary


 
 Roland
 -- 
 R.F.Smith   http://www.xs4all.nl/~rsmith/
 [plain text _non-HTML_ PGP/GnuPG encrypted/signed email much appreciated]
 pgp: 1A2B 477F 9970 BA3C 2914  B7CE 1277 EFB0 C321 A725 (KeyID: C321A725)



-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.17a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Gary Kline
On Tue, Dec 30, 2008 at 10:16:33PM +0100, Bertram Scharpf wrote:
 Hi Gary,
 
 Am Dienstag, 30. Dez 2008, 11:31:14 -0800 schrieb Gary Kline:
  The problem is that there are many, _many_ embedded 
  A HREF=http://whatever Site/A in my hundreds, or
  thousands, or files.  I only want to delete the
  http://junkfoo.com lines, _not_ the other Href links.
 
  sed or perl?
 
 Ruby. Untested:
 
   $ ruby -i.bak -pe 'next if ~/href=([^]*)/i and $1 == 
 http://example.com;' somefile.html
 
 Probably you want to do something more sophisticated.
 
 Bertram
 

Hi Bertram,

Well, after about 45 minutes of mousing cut/paste/edit, plus
editing scripts, i ain't there yet.  if i use the 

   perl -e 'print unless /m/http:/ || eof; close ARGV if eof' *.htm

no errors, but the new.htm is == new.htm.bak; in other words,
it looks like a partial match on just http fails.  Don't
know why.  i'm pretty sure the entire A HREF=http://foobar.com; xxx 
/A
would do it.  

roland, the dbl quote were necessary it seems.  maybe i'll
try parens.

gary



 
 -- 
 Bertram Scharpf
 Stuttgart, Deutschland/Germany
 http://www.bertram-scharpf.de

-- 
 Gary Kline  kl...@thought.org  http://www.thought.org  Public Service Unix
http://jottings.thought.org   http://transfinite.thought.org
The 2.17a release of Jottings: http://jottings.thought.org/index.php

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org


Re: well, blew it... sed or perl q again.

2008-12-30 Thread Bertram Scharpf
Hi Gary,

Am Dienstag, 30. Dez 2008, 17:48:02 -0800 schrieb Gary Kline:
 On Tue, Dec 30, 2008 at 10:16:33PM +0100, Bertram Scharpf wrote:
  Hi Gary,
  
  Am Dienstag, 30. Dez 2008, 11:31:14 -0800 schrieb Gary Kline:
   The problem is that there are many, _many_ embedded 
   A HREF=http://whatever Site/A in my hundreds, or
   thousands, or files.  I only want to delete the
   http://junkfoo.com lines, _not_ the other Href links.
  
   sed or perl?
  
  Ruby. Untested:
  
$ ruby -i.bak -pe 'next if ~/href=([^]*)/i and $1 == 
  http://example.com;' somefile.html
  
  Probably you want to do something more sophisticated.
 
   no errors, but the new.htm is == new.htm.bak; in other words,
   it looks like a partial match on just http fails.  Don't
   know why.  i'm pretty sure the entire A HREF=http://foobar.com; xxx 
 /A
   would do it.  

This is not FreeBSD-specific, though.

I still wonder why you rely on lines just containing
%r{^A.*.*/A$} . Maybe you're doing a quick'n'dirty solution
but I'm quite sure you won't get along with a one-liner.

Bertram


-- 
Bertram Scharpf
Stuttgart, Deutschland/Germany
http://www.bertram-scharpf.de
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to freebsd-questions-unsubscr...@freebsd.org