----- Original Message ----- From: "Peter Eisengrein" <[EMAIL PROTECTED]> To: "'Carl Jolley'" <[EMAIL PROTECTED]> Cc: "Perl-Win32-Users Mailing List (E-mail)" <[EMAIL PROTECTED]> Sent: Tuesday, March 19, 2002 4:37 PM Subject: RE: multiline parse
> I shouldn't have been so vague, I suppose, but I think you are getting it, > Carl. In this particular instance the pattern of 4,3,3 will repeat > throughout the file. I come upon this multiline problem regularly but for > different layouts -- there may be more or less than three line-groupings per > record, and each line-grouping may have different number of columns. And the > each key/value combination may be spread over 2 or more lines. Since I have > tackled this several times now, with different patterns each time, I was > wondering if there was a module that could be given a few key bits of data > and then parse it appropriately. > Peter, Are you meaning to say that none of the following are static, even within a single data file? : 1) number of lines per record 2) number of columns per record 3) number of lines per column value 4) names of columns in record Given all of this, a single module, with a single set of logic may not be appropriate for all tasks. Something has to be static, or the logic has to be pretty robust, and make alot of assumptions which may not always be true. Hopefully, you can at least get #2 or #4 to be static, I've already given an example of how to do this if #2 is static (and, actually, #1 is static also). For #4, using the same input example, let's look at how we can do this with the following assumption in place: 5) no columns may be repeated in a single record I'm also going to get rid of that line, in your original example, which seperated the column name list from the column value. It's not really necessary, but for how to deal with it, look at the while(<DATA>) loop in my original example (and the much harried $cnt variable =). This example will also assume that there is only one line per column value (#3) -- You really should use an enclosment, of some sort if the value exceeds a single line, such as double quotes. e.g.: FIELDx "my name is: sally" otherwise, you'll end up biting the big one if someone uses your record seperator in the column value. AND, if you do this, you'll either need to expand your logic in the while(<DATA>) loop, or you'll have to slurp the whole darn thing into a single line, and use a /s modifier to a regexp. I'm going to ignore that requirement at the moment, and just look at column grouping =) Assuming that these are the possible column name values: F1,F2,F3,F4,F5 : --code-- # possible delimiters my $delimiters = '\s|\t'; my(@columns,@values); my $cnt = 0; # the value of $cnt is initially 0, since the format of the data is: # COLUMN NAMES # VALUES LIST # use two possible values, 0 for COLUMN NAMES line # 1 for VALUES LIST lines. while(<DATA>) { # if the value of $cnt is equvalent to zero, push into @columns push(@columns,split(/(?i:$delimiters)+/,$_)) if($cnt == 0); # if the value of $cnt is equivalent to 1, push into @values # this means we're on the second line in a 'line group' push(@values,split(/(?i:$delimiters)+/,$_)) if($cnt == 1); # if the value of $cnt is less than 1, # auto-increment... # otherwise, if it happens to be 1, set it to 0. (start # looking for columns, again $cnt = $cnt < 1 ? $cnt + 1 : 0; } #endwhile # initialize a hash my %data = (); # now, we want to go find groups of columns, we assume that column # names cannot be repeated in a single record. # init two hashes (one temp [lastfind], one perm[groups]) # and a counter to indicate which record (or group of columns) # we're currently in. starting at zero (you can start at any number) my %lastfind = (); my %groups = (); my $cur_grp = 0; # go through @columns, entry by entry foreach my $ent (0..$#columns) { # get actual column name from @columns for this position my $c_name = $columns[$ent]; # this is just to be verbose, perl actually will automatically create the # the hash key for us in the push() inside of the else { block below # on the first run, but what the heck, eh? $groups{"$cur_grp"} = () if(!exists($groups{"$cur_grp"})); # if we already have an entry for the column name in %lastfind, then # we're repeating a column name, and must be starting a new record/group # of column names. if(exists($lastfind{"$c_name"})) { # increment current group/record # and create a new key in our organization # hash. $groups{"$cur_grp"} points to an array $cur_grp++; $groups{"$cur_grp"} = (); # push an anonymous array into the $groups{"$cur_grp"} array, containing # two elements, the column name, and the position it occured in @columns push(@{ $groups{"$cur_grp"} },[ $c_name,$ent ]); # reset %lastfind (so previously matched columns will not already be matched # on future runs. And create a key for the column we just found (that was repeated from # the previous group) %lastfind = (); $lastfind{"$c_name"} = 1; next; } else { # otherwise, we don't already have this column name listed... $lastfind{"$c_name"} = 1; # add column name and position in @columns to our $groups{"$cur_grp"} # key (which, as stated above, points to an array) push(@{ $groups{"$cur_grp"} },[ $c_name,$ent ]); next; } } # this assumes that if a column is listed, so is a value # that is, if $#columns == 30, then $#values == 30) # for each group of columns / record: foreach my $key (keys(%groups)) { print("Record (col group) #: $key\n"); # $groups{"$key"} is a pointer to an array, where each element # is a two-key anonymous array foreach my $aRef (@{ $groups{"$key"} }) { my $col_name = $aRef->[0]; my $val_pos = $aRef->[1]; # as stated above, this assume that $#columns == $#values; my $actual_value = $values[$val_pos]; print("\t$col_name = $actual_value\n"); } } __DATA__ F1 F2 F3 F4 F5 dataf1 dtaf2 dttaf3 dttttt4 dtt5f F3 F5 F4 F1 dtf3 dtf5 dtf4 dtf1 F1 dtatf1 F3 F4 F5 F2 df3 df4 df5 df2 F2 F3 F4 dtf2 dtf3 dtf4 --endscript-- Which produces the following output: C:\temp>perl -w multi2.pl Record (col group) #: 0 F1 = dataf1 F2 = dtaf2 F3 = dttaf3 F4 = dttttt4 F5 = dtt5f Record (col group) #: 1 F3 = dtf3 F5 = dtf5 F4 = dtf4 F1 = dtf1 Record (col group) #: 2 F1 = dtatf1 F3 = df3 F4 = df4 F5 = df5 F2 = df2 Record (col group) #: 3 F2 = dtf2 F3 = dtf3 F4 = dtf4 C:\temp> Note how group #2 got to include F3,F4,F5, and F2 -- this can be a bug, or a feature, if you so please. The reason is that the next line (which had F3,F4,F5,F2) of the data file did not have any column names which were repeated in the previous (which had only F1) Hope this helps add another piece to your puzzle. !c (dolljunkie) _______________________________________________ Perl-Win32-Users mailing list [EMAIL PROTECTED] To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs