|
hi jerry --
In a message dated 4/9/2006 11:06:10 A.M. Eastern Standard Time,
[EMAIL PROTECTED] writes:
> Hi!
> > I am not sure how to start to use regex in the following. > > 2418 23rd, 3/1/cp Storage, $675/mo+Bills, $500/dep, 1500sf, Great Deal! 806-632-4037 > How to pull: 2418 23rd and 3/1/cp and 675 and 806-632-4037? > > LOWER PRICE Spacious 3/2/2 5742 36th 1700sf, Nice Neighborhood FP, Appls, $850/mo $400dep 828-1770 > How to pull: 3/2/2 and 5742 36th and 850 and 828-1770? > > Remodeled Kitchen, 3/2/1, Rush Area 4812 12 St. 1450sqft, Lrg Yd/Patio $850/mo 763-7677 or 786-4862 > How to pull: 3/2/1 and 4812 12 St and 850 and 763-7677 786-4862? > > …………… > > 4511 52nd, Lg. 3/2/2 Brick, C-H&A, Fresh Paint, 1600+ sf Great Loc. $850/mo Call 252-6928 > 2507 41st Cute 2/1/1 Big & spacious $650. First Mark 793-8759 / 789-0477 > 3709 24th Close to Hospitals & Tech 3-2 CH&A, Fridge, W/D Included $750 - First Mark 793-8759 > FREE RENT - New 3/3/2, 1300sf, Appls. Very Nice. Covered Patio. 308 N. Clinton. 543-6016 > > -------------------------------------------------------------------------------- > > Roomy 3/2/2, large closets, 2 living areas, skylights, new carpet, nice kitchen, CHA, 1920sf. 4314 49th $975 793-1712 > 3/1.5/1 House, 2303 84th, Convenient, Dishwasher, Large Bkyd, NS, $775/mo Call 698-8831 > 3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 214-402-4414 > > I have started with the following: > > #!/perl > > use strict; > > local( *FI, *FO ); > > my $DEBUG = 00; > my $DEBUG1 = 01; > > open( FI, "RH.lst" ) || die "RH.lst $!\n"; > > while(<FI>) { > > chomp; > > print $_; > > my $money =~ /\$(\d+)\/mo/; > > print "\nmoney $1\n"; > > my $phone =~ /(\d{3})-(\d+)-(\d+)-/; > > print "\nphone $1 * $2 * $3 *\n"; > > exit; > > } > > close FI; > > Thanks, > > Jerry i'm not sure about the meaning of the 3/2/1 notation - i guess the first
digit is bedrooms,
the second baths, but what is the third?
street names are the trickiest to handle - there are so many variations and
exceptions
in the test data you supply. in the examples given below,
street names with two parts
(see David Givens example) fail, but all the others given seem to be
extracted pretty well.
(note that i have added a few other test cases.)
i would be interested to know how this code works for you.
hth -- bill walters
------------------- code begins -------------------------
use strict;
use warnings; # each newline-delimited line in file is an entry. # blank lines are ignored. # regex definitions for components of listing entry.
# # configuration (number of bedrooms, baths, etc.): # may be before or after street address in entry; # may be preceded or followed by other info, e.g., ``Storage'', # ``Spacious'', ``Lg.'', ``Brick'' which is NOT to be extracted; # may be preceded and followed by commas (not to be extracted); # may be preceded or followed by street address; # configuration delimiter may be slash or dash, e.g., ``3-2''; # numeric fields may be 1 or 2 digits; # two numeric fields are always present; # a third alphanumeric field may be present and is to be extracted. my $config_delim = qr([-/]);
my $config_bedrooms = qr( \d )x; # e.g., 1, 2, 3 my $config_baths = qr( \d (?: \.5 )? )x; # e.g., 1, 1.5, 2 my $config_other = qr( $config_delim \w+ )x; my $configuration = qr( $config_bedrooms $config_delim $config_baths $config_other? )x; # street address (the trickiest one - many, many
variations!):
# may be anywhere in entry; # is always preceded by whitespace or beginning of entry string; # is ALWAYS numerics followed by alphanumerics with intervening # whitespace(s); # ...and some other conditions. my $st_number = qr( \d\d+ )x; # NOTE: must be at least two
digits!
my $st_compass = qr( \s+ (?: n|no|north|ne|nw|s|so|south|se|sw|e|east|w|west ) \.? )ix; my $st_title = qr( \s+ (?: st | ave | blvd ) \.? )ix; my $st_ordinal = qr( \s+ \d+ (?: st | nd | rd | th )? )ix; my $st_alpha = qr( \s+ [a-z]+ )ix; my $st_name = qr( $st_compass? (?: $st_ordinal | $st_alpha ) $st_title? )x; my $street_address = qr( (?: (?<= \s ) | ^ ) $st_number $st_name )x; # phone number:
# may or may not have preceding area code in ``321-'' or ``(321)'' # format; # may be present more than once, and all instances must be captured; # if present more than once, delimiter(s) between instances is undetermined. my $area_code = qr[ \( \d\d\d \) \s* | \d\d\d- ]x;
my $exchange = qr( \d\d\d - \d\d\d\d )x; my $phone = qr( $area_code? $exchange )x; # monthly rental:
# may NOT be present; # if present, is ALWAYS preceded by dollar sign with no intervening space; # may be accompanied by a deposit dollar amount, but deposit always follows # monthly rental in entry; # is to be captured WITHOUT dollar sign. my $mo_bucks = qr( \d+ )x;
my $mo_cents = qr( \.\d\d )x; # cents needed? my $monthly_rental = qr( (?<= \$ ) $mo_bucks $mo_cents? )x; my $listings_file = shift or die "no listings file given"; open my $listings_fh, '<', $listings_file or die "opening $listings_file: $!"; ENTRY:
while (defined (my $entry = <$listings_fh>)) { next ENTRY if $entry =~ / ^ \s* $ /x; # ignore
blank line
my ($config) = $entry =~ m( ($configuration)
)x;
my ($rent) = $entry =~ m( ($monthly_rental) )x; my ($address) = $entry =~ m( ($street_address) )x; my @phones = $entry =~ m( $phone )gx; # fix up any values that might not be defined in an
entry
$rent = '?' unless $rent; @phones = ('?') unless @phones; # can phone be undefined? printf "address: %s config.: %s rental:
%s phone(s): %s
\n",
$address, $config, $rent, join(', ', @phones); print $entry, "\n"; my $in = <>; last
if $in =~ /\S/; # FOR DEBUG
}
close $listings_fh or die "closing $listings_file: $!";
-------------------code ends -----------------------------
------------------- test data begins ---------------------
2418 23rd, 3/1/cp Storage, $675/mo+Bills, $500/dep, 1500sf, Great Deal!
806-632-4037
LOWER PRICE Spacious 3/2/2 5742 36th 1700sf, Nice Neighborhood FP, Appls, $850/mo $400dep 828-1770 Remodeled Kitchen, 3/2/1, Rush Area 4812 12 St. 1450sqft, Lrg Yd/Patio $850/mo 763-7677 or 786-4862 4511 52nd, Lg. 3/2/2 Brick, C-H&A, Fresh Paint, 1600+ sf Great Loc. $850/mo Call 252-6928 4511 West 52nd, Lg. 3/2/2 Brick, C-H&A, Fresh Paint, 1600+ sf Great Loc. $850/mo Call 252-6928 2507 41st Cute 2/1/1 Big & spacious $650. First Mark 793-8759 / 789-0477 3709 24th Close to Hospitals & Tech 3-2 CH&A, Fridge, W/D Included $750 - First Mark 793-8759 FREE RENT - New 3/3/2, 1300sf, Appls. Very Nice. Covered Patio. 308 N. Clinton. 543-6016 Roomy 3/2/2, large closets, 2 living areas, skylights, new carpet, nice kitchen, CHA, 1920sf. 4314 49th $975 793-1712 3/1.5/1 House, 2303 84th, Convenient, Dishwasher, Large Bkyd, NS, $775/mo Call 698-8831 3/1.5/1 House, 2303 David Givens, Convenient, Dishwasher, Large Bkyd, NS, $775/mo Call 698-8831 ADDRESS FAILS!!! 3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 214-402-4414 3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 (214) 402-4414, 321-9876 3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 (214)402-4414 -------------------test data ends -------------------------
|
_______________________________________________ ActivePerl mailing list [email protected] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
