hi jerry --  
 
In a message dated 4/9/2006 11:06:10 A.M. Eastern Standard Time, [EMAIL PROTECTED] writes:
 
> Hi!
>
> I am not sure how to start to use regex in the following. 
>
> 2418 23rd, 3/1/cp Storage, $675/mo+Bills, $500/dep, 1500sf, Great Deal! 806-632-4037
> How to pull:  2418 23rd  and 3/1/cp and 675 and 806-632-4037?
>
> LOWER PRICE Spacious 3/2/2 5742 36th 1700sf, Nice Neighborhood FP, Appls, $850/mo $400dep 828-1770
> How to pull: 3/2/2  and  5742 36th and 850 and 828-1770?
>
> Remodeled Kitchen, 3/2/1, Rush Area 4812 12 St. 1450sqft, Lrg Yd/Patio $850/mo 763-7677 or 786-4862
> How to pull: 3/2/1 and 4812 12 St and 850 and 763-7677 786-4862?
>
> ……………
>
> 4511 52nd, Lg. 3/2/2 Brick, C-H&A, Fresh Paint, 1600+ sf Great Loc. $850/mo Call 252-6928
> 2507 41st Cute 2/1/1 Big & spacious $650. First Mark 793-8759 / 789-0477
> 3709 24th Close to Hospitals & Tech 3-2 CH&A, Fridge, W/D Included $750 - First Mark 793-8759
> FREE RENT - New 3/3/2, 1300sf, Appls. Very Nice. Covered Patio. 308 N. Clinton. 543-6016
>
> --------------------------------------------------------------------------------
>
> Roomy 3/2/2, large closets, 2 living areas, skylights, new carpet, nice kitchen, CHA, 1920sf. 4314 49th $975 793-1712
> 3/1.5/1 House, 2303 84th, Convenient, Dishwasher, Large Bkyd, NS, $775/mo Call 698-8831
> 3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 214-402-4414
>
> I have started with the following:
>
> #!/perl
>
>   use strict;
>
>   local( *FI, *FO );
>
>   my $DEBUG           = 00;
>   my $DEBUG1          = 01;
>
>   open( FI, "RH.lst" ) || die "RH.lst $!\n";
>
>   while(<FI>) {
>
>     chomp;
>
>     print $_;
>
>     my $money =~ /\$(\d+)\/mo/;
>
>     print "\nmoney $1\n";
>
>     my $phone =~ /(\d{3})-(\d+)-(\d+)-/;
>
>     print "\nphone $1 * $2 * $3 *\n";
>
> exit;
>
>   }
>
>   close FI;
>
> Thanks,
>
> Jerry
 
i'm not sure about the meaning of the 3/2/1 notation - i guess the first digit is bedrooms,
the second baths, but what is the third?  
 
street names are the trickiest to handle - there are so many variations and exceptions
in the test data you supply.   in the examples given below, street names with two parts
(see David Givens example) fail, but all the others given seem to be extracted pretty well.  
 
(note that i have added a few other test cases.)  
 
i would be interested to know how this code works for you.  
 
hth -- bill walters  
 
 
------------------- code begins -------------------------
 
use strict;
use warnings;
 

# each newline-delimited line in file is an entry.
# blank lines are ignored.
 
# regex definitions for components of listing entry.
#
# configuration (number of bedrooms, baths, etc.):
#   may be before or after street address in entry;
#   may be preceded or followed by other info, e.g., ``Storage'',
#   ``Spacious'', ``Lg.'', ``Brick'' which is NOT to be extracted;
#   may be preceded and followed by commas (not to be extracted);
#   may be preceded or followed by street address;
#   configuration delimiter may be slash or dash, e.g., ``3-2'';
#   numeric fields may be 1 or 2 digits;
#   two numeric fields are always present;
#   a third alphanumeric field may be present and is to be extracted.
 
my $config_delim    = qr([-/]);
my $config_bedrooms = qr( \d )x;  # e.g., 1, 2, 3
my $config_baths    = qr( \d (?: \.5 )? )x;  # e.g., 1, 1.5, 2
my $config_other    = qr( $config_delim \w+ )x;
my $configuration
    = qr( $config_bedrooms $config_delim $config_baths  $config_other? )x;
 
# street address (the trickiest one - many, many variations!):
#   may be anywhere in entry;
#   is always preceded by whitespace or beginning of entry string;
#   is ALWAYS numerics followed by alphanumerics with intervening
#   whitespace(s);
#   ...and some other conditions.
 
my $st_number  = qr( \d\d+ )x;  # NOTE: must be at least two digits!
my $st_compass = qr( \s+ (?: n|no|north|ne|nw|s|so|south|se|sw|e|east|w|west ) \.? )ix;
my $st_title   = qr( \s+ (?: st | ave | blvd ) \.? )ix;
my $st_ordinal = qr( \s+ \d+ (?: st | nd | rd | th )? )ix;
my $st_alpha   = qr( \s+ [a-z]+ )ix;
my $st_name    = qr( $st_compass? (?: $st_ordinal | $st_alpha ) $st_title? )x;
my $street_address = qr( (?: (?<= \s ) | ^ ) $st_number $st_name )x;
 
# phone number:
#   may or may not have preceding area code in ``321-'' or ``(321)''
#   format;
#   may be present more than once, and all instances must be captured;
#   if present more than once, delimiter(s) between instances is undetermined.
 
my $area_code = qr[ \( \d\d\d \) \s* | \d\d\d- ]x;
my $exchange  = qr( \d\d\d - \d\d\d\d )x;
my $phone = qr( $area_code? $exchange )x;
 
# monthly rental:
#   may NOT be present;
#   if present, is ALWAYS preceded by dollar sign with no intervening space;
#   may be accompanied by a deposit dollar amount, but deposit always follows
#   monthly rental in entry;
#   is to be captured WITHOUT dollar sign.
 
my $mo_bucks = qr( \d+ )x;
my $mo_cents = qr( \.\d\d )x;  # cents needed?
my $monthly_rental = qr( (?<= \$ ) $mo_bucks $mo_cents? )x;
 

my $listings_file = shift
    or die "no listings file given";
 

open my $listings_fh, '<', $listings_file or die "opening $listings_file: $!";
 
ENTRY:
while (defined (my $entry = <$listings_fh>)) {
 
    next ENTRY if $entry =~ / ^ \s* $ /x;  # ignore blank line
 
    my ($config)  = $entry =~ m( ($configuration)  )x;
    my ($rent)    = $entry =~ m( ($monthly_rental) )x;
    my ($address) = $entry =~ m( ($street_address) )x;
    my  @phones   = $entry =~ m(  $phone           )gx;
 
    # fix up any values that might not be defined in an entry
    $rent   =  '?'  unless $rent;
    @phones = ('?') unless @phones;  # can phone be undefined?
 
    printf "address: %s  config.: %s  rental: %s  phone(s): %s \n",
            $address, $config, $rent, join(', ', @phones);
 
    print $entry, "\n";  my $in = <>;  last if $in =~ /\S/;  # FOR DEBUG
 
    }
 
close $listings_fh or die "closing $listings_file: $!";
-------------------code ends -----------------------------
 
------------------- test data begins ---------------------
 
2418 23rd, 3/1/cp Storage, $675/mo+Bills, $500/dep, 1500sf, Great Deal! 806-632-4037
LOWER PRICE Spacious 3/2/2 5742 36th 1700sf, Nice Neighborhood FP, Appls, $850/mo $400dep 828-1770
Remodeled Kitchen, 3/2/1, Rush Area 4812 12 St. 1450sqft, Lrg Yd/Patio $850/mo 763-7677 or 786-4862
4511 52nd, Lg. 3/2/2 Brick, C-H&A, Fresh Paint, 1600+ sf Great Loc. $850/mo Call 252-6928
4511 West 52nd, Lg. 3/2/2 Brick, C-H&A, Fresh Paint, 1600+ sf Great Loc. $850/mo Call 252-6928
2507 41st Cute 2/1/1 Big & spacious $650. First Mark 793-8759 / 789-0477
3709 24th Close to Hospitals & Tech 3-2 CH&A, Fridge, W/D Included $750 - First Mark 793-8759
FREE RENT - New 3/3/2, 1300sf, Appls. Very Nice. Covered Patio. 308 N. Clinton. 543-6016
Roomy 3/2/2, large closets, 2 living areas, skylights, new carpet, nice kitchen, CHA, 1920sf. 4314 49th $975 793-1712
3/1.5/1 House, 2303 84th, Convenient, Dishwasher, Large Bkyd, NS, $775/mo Call 698-8831
3/1.5/1 House, 2303 David Givens, Convenient, Dishwasher, Large Bkyd, NS, $775/mo Call 698-8831  ADDRESS FAILS!!!
3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 214-402-4414
3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 (214) 402-4414, 321-9876
3007 30th for Rent in Tech Terrace 3/2, $1200/mo Available 6/1/06 (214)402-4414
-------------------test data ends -------------------------
 
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to