James G. Sack (jim) wrote: > John H. Robinson, IV wrote: >> Ralph Shumaker wrote: >>> I've been compiling my script, using sed to do everything I was >>> previously doing in vim. However, I've hit a snag. One thing that works >>> in vim does *not* in sed. >>> >>> vim would strip out all unwanted line feeds with: >>> ":%s/\([ a-zA-Z0-9,\.:;?!)?-]\)\n\([A-Z^a-z(]\)/\1 \2/cg" >>> >>> In my script, >>> "sed -e 's/\([ a-zA-Z0-9,\.:;?!)?-]\)\n\([A-Z^a-z(]\)/\1 \2/g' 0035 >0036" >>> doesn't change anything, so (as a test) I reduced it down to match one >>> line in particular: >>> "sed -e 's/e\nu/e u/g' 0035" >>> and still no go. But reducing it to: >>> "sed -e 's/e$/eeeeeee/g' 0035" >>> or >>> "sed -e 's/^u/uuuuuuu/g' 0035" >>> works (except that it does nothing to the newline). >>> >>> Any suggestions? >> Almost sounds like a job for perl. I will have to go back to the >> original problem to see if a nice, clean perl one-liner can tend to >> this. >> > > I'm sure jhriv will come up w/ a one-liner for you, but I just wanted to > remark that your use of \. within character class brackets is not > required. If vim requires it, then it's vim that's broke. > > The usual language is that '.' has no special meaning within brackets. > > It's useful to dwell on this a bit -- spending a few minutes here makes > regular expressions a little less intimidating. > > Normally the dot character stands for _any character_ so you can see > that it wouldn't make much sense to define a character class list that > contains such a wild card. If '.' means anything, then any other content > is superfluous! > > For further thought, the only specials within brackets ought to be '-' > (for ranges) and ']' (the end-delimiter for the character class). > Then you kinda have to add '\' to the specials so that you can write > '\]' to mean a literal ']'. You can also use '\-' and, of course '\\'. > By convention, putting the '-' as the first or last character in the > brackets also means a literal '-'. I suppose putting ']' as the first > character might logically also mean a literal ']' (since otherwise you > have an empty class -- maybe it actually works that way?. Arguably, > using '\-', '\]' is better than remembering additional conventions, but > I mention it so that you will recognize them when you see them. >
I should have also mentioned constructs such as '\xHH' for hex-values (and relatives for octal or decimal), and then other conventional names of non-printing characters, eg, '\n'. One problem is that different applications have different names and conventions. That adds some extra challenge to regex use <sigh>. ..j -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
