|
In a message dated 12/10/2005 7:04:59 P.M. Eastern Standard Time,
[EMAIL PROTECTED] writes:
> Hello:
> I'm trying to build a routine to split a string in fields by a specified delimiter. The string format is pretty close to CSV, except that quoted substrings can appear within an > unquoted string, and escaped quotes can exist within quoted strings, and the delimiter might > exist within a quoted string (like in CSV). Specifically, its to split recipient lists from > e-mail "To:" headers. So for example: > > "LastName, FirstName" <address>, "Name" <address>, <address>; FirstName LastName, address; "First \"nick\" Last" address > > The above string should be splitted into: > "LastName, FirstName" <address> > "Name" <address> > <address> > FirstName LastName > address > "First \"nick\" Last" address > > All unquoted surrounding whitespace should be removed. I've gotten so far as this: > > # modified from the Perl Cookbook > push(@list, $+) > while $text =~ /\s*("[^\"\\]*(?:\\.[^\"\\]*)*"(?:\s+[^;,]+))\s*[;,]?\s*|\s*([^;,]+)\s*[;,]?\s*|[;,]\s*/g; > push(@list, undef) if (substr($text, -1, 1) =~ /[;,]/); > > > But since the matches seem to be too greedy, it keeps trailing space before the delimiters. Can someone offer a better solution? > > NOTE: I want it to be as generic as possible as I cannot expect the elements in the list to follow strict guidelines (there are too many broken programs out there and too many idiots!) > > Thanks! > dZ. hi dZ -
maybe try something like:
=================== begin code =====================
use warnings; use strict; my $text = q( "LastName, FirstName"
<address> , "Name" <address>, <add
ress>; FirstName LastName, address ; "First \"nick\" Last" ad dre
ss );
# The above string should be splitted into:
# "LastName, FirstName" <address> # "Name" <address> # <add ress> # FirstName LastName # address # "First \"nick\" Last" ad dre ss # All unquoted surrounding whitespace should be removed. I've gotten
so
# far as this: # modified from modification from the Perl Cookbook
my @list;
# a delimiter:
# contains a single delimiter character; # may have any amount of whitespace before or after delimiter character. my $delimiters = q(;,); # list of delimiter characters my $delimiter = qr/ \s* [$delimiters] \s* /; # delimiter sequence # a double-quoted string:
# includes opening and closing double-quotes; # may be empty; # may contain any characters, including delimiter characters; # may have backslash-escaped characters, including escaped double-quotes. my $quoted_string = qr/ " [^"\\]* (?: \\. [^"\\]* )* " /x; # an address:
# must contain at least one character; # may not contain any delimiter character; # may have embedded whitespace, but may not begin or end with ws. my $not_delimiter_or_ws = qr/[^$delimiters\s]/; my $address = qr/ $not_delimiter_or_ws+ (?: \s+ $not_delimiter_or_ws+ )* /x; push(@list, $+) # while $text =~ /\s*("[^\"\\]*(?:\\.[^\"\\]*)*"(?:\s+[^;,]+))\s*[;,]?\s*|\s*([^;,]+)\s*[;,]?\s*|[;,]\s*/g; # while $text =~ /
\s*
# optional ws
# ( " [^\"\\]* (?: \\. [^\"\\]* )* " # quoted string... # (?: \s+ [^;,]+ ) # then address (unnecessary grouping) # ) # \s* # optional ws # [;,]? # optional delimiter # \s* # optional ws # | # or # \s* # optional ws # ( [^;,]+ ) # address # \s* # optional ws # [;,]? # optional delimiter # \s* # optional ws # | # or # [;,] # required delimiter # \s* # optional ws # /xg; while $text =~
/
#
either
\s* # optional ws ( # capture... $quoted_string # quoted string \s+ # required ws $address # address ) $delimiter? # optional delimiter | # or \s* # optional ws ( # capture... $address # address ) $delimiter? # optional delimiter | # or $delimiter # required delimiter /xg; push(@list, undef) if (substr($text, -1, 1) =~ /[;,]/); # i'm not
sure just what this is for
{ local $" = "]\n["; print "[EMAIL PROTECTED] \n"; }
================= end code =========================
hth -- bill walters |
_______________________________________________ ActivePerl mailing list [email protected] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
