hi guys --   

In a message dated 5/15/2009 8:55:30 PM Eastern Standard Time, 
williamawalt...@aol.com writes:

> In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time, 
ari.constan...@gmail.com writes: 
> 
> > On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <bbre...@stellarmicro.com
> wrote: 
> > 
> > > I am running Active Perl 5.8.8. 
> > > ... 
> > > Difficulty: the fields contain hundreds of words both preceding and 
> > > following the "bad" words, so I have to be able to pick out the 
> > > lower-case words that contain one embedded upper-case character. 
> > > ... 
> > > Barry Brevik 
> > 
> > Hi Barry, 
> > 
> > Maybe something like this would help: 
> > 
> > $ cat test.txt 
> > madeStyle 
> > facilitatedOne 
> > Anti-magneticQuality 
> > 
> > $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g' 
> > made. Style 
> > facilitated. One 
> > Anti-magnetic. Quality 
> > 
> > Regards, Ari Constancio 
> 
> ...
>
> a better approach might be something like:    
> 
> >cat test.txt | perl -wMstrict -pe 
> "s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg" 
> made. Style 
> facilitated. One 
> Anti-magnetic. Quality 
> 123FOO 
> 
> hth -- bill walters    

well, english is a complicated thing, as, i guess, are all natural 
languages.   

it occurred to me that the solution i suggested, that a new sentence begins 
with a uc letter and at least one lc letter (which was how i interpreted 
the 
original 'lower-case words that contain one embedded upper-case character' 
spec), fails for a very common word.   the approach below makes separate 
regex definitions for end-of-sentence and beginning-of-sentence patterns; 
these are more easily adapted as requirements mature.   

of course, the new approach fails for BiCapitalized words.   sigh.   
using separate regex definitions might come into play here: one might, 
for instance, define a list of bi-capitalized words that would be used with 
a look-around to avoid improper substitutions.   

(i cannot think of a case in which a proper sentence ends with 
anything other than an lc letter before the period.   if there is such, 
the separate regex approach could, i think, be easily adapted to handle 
it.)   

>cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
123FOO
the endA new
PowerPoint

>cat test.txt | perl -wMstrict -pe
"INIT {
   my $sen_end = qr{ [[:lower:]] }xms;
   my $new_sen = qr{ [[:upper:]] }xms;
   sub S { s{ ($sen_end) ($new_sen) }{$1. $2}xmsg }
   }
 S;
"
made. Style
facilitated. One
Anti-magnetic. Quality
123FOO
the end. A new
Power. Point

again, hth -- bill walters   
<BR><BR>**************<BR>Recession-proof vacation ideas.  Find free things to 
do in 
the U.S. 
(http://travel.aol.com/travel-ideas/domestic/national-tourism-week?ncid=emlcntustrav00000002)</HTML>
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to