Re: multiline parse

c. church Wed, 20 Mar 2002 12:41:33 -0800


----- Original Message -----
From: "Peter Eisengrein" <[EMAIL PROTECTED]>
To: "'Carl Jolley'" <[EMAIL PROTECTED]>
Cc: "Perl-Win32-Users Mailing List (E-mail)"
<[EMAIL PROTECTED]>
Sent: Tuesday, March 19, 2002 4:37 PM
Subject: RE: multiline parse



> I shouldn't have been so vague, I suppose, but I think you are getting it,
> Carl. In this particular instance the pattern of 4,3,3 will repeat
> throughout the file. I come upon this multiline problem regularly but for
> different layouts -- there may be more or less than three line-groupings
per
> record, and each line-grouping may have different number of columns. And
the
> each key/value combination may be spread over 2 or more lines. Since I
have
> tackled this several times now, with different patterns each time, I was
> wondering if there was a module that could be given a few key bits of data
> and then parse it appropriately.
>


Peter,

Are you meaning to say that none of the following are static, even within a
single data file? :

1) number of lines per record
2) number of columns per record
3) number of lines per column value
4) names of columns in record

Given all of this, a single module, with a single set of logic may not be
appropriate for all tasks.

Something has to be static, or the logic has to be pretty robust, and make
alot of assumptions which may not always be true.

Hopefully, you can at least get #2 or #4 to be static, I've already given an
example of how to do this if #2 is static (and, actually, #1 is static
also).  For #4, using the same input example, let's look at how we can do
this with the following assumption in place:

  5)  no columns may be repeated in a single record

I'm also going to get rid of that line, in your original example, which
seperated the column name list from the column value.  It's not really
necessary, but for how to deal with it, look at the while(<DATA>) loop in my
original example (and the much harried $cnt variable =).


This example will also assume that there is only one line per column value
(#3) -- You really should use an enclosment, of some sort if the value
exceeds a single line, such as double quotes. e.g.:

FIELDx
"my name is:
    sally"

otherwise, you'll end up biting the big one if someone uses your record
seperator in the column value.

AND, if you do this, you'll either need to expand your logic in the
while(<DATA>) loop, or you'll have to slurp the whole darn thing into a
single line, and use a /s modifier to a regexp.

I'm going to ignore that requirement at the moment, and just look at column
grouping =)


Assuming that these are the possible column name values: F1,F2,F3,F4,F5 :

--code--

    # possible delimiters
my $delimiters = '\s|\t';

my(@columns,@values);

my $cnt = 0;

    # the value of $cnt is initially 0, since the format of the data is:
    # COLUMN NAMES
    # VALUES LIST
    # use two possible values, 0 for COLUMN NAMES line
    # 1 for VALUES LIST lines.

while(<DATA>) {

        # if the value of $cnt is equvalent to zero, push into @columns

    push(@columns,split(/(?i:$delimiters)+/,$_)) if($cnt == 0);

        # if the value of $cnt is equivalent to 1, push into @values
        # this means we're on the second line in a 'line group'

    push(@values,split(/(?i:$delimiters)+/,$_)) if($cnt == 1);

        # if the value of $cnt is less than 1,
        # auto-increment...
        # otherwise, if it happens to be 1, set it to 0. (start
        # looking for columns, again

     $cnt = $cnt < 1 ? $cnt + 1 : 0;

    } #endwhile


    # initialize a hash

my %data = ();

    # now, we want to go find groups of columns, we assume that column
    # names cannot be repeated in a single record.

    # init two hashes (one temp [lastfind], one perm[groups])
    # and a counter to indicate which record (or group of columns)
    # we're currently in.  starting at zero (you can start at any number)

my %lastfind = ();
my %groups = ();
my $cur_grp = 0;

    # go through @columns, entry by entry

foreach my $ent (0..$#columns) {

        # get actual column name from @columns for this position

    my $c_name = $columns[$ent];

        # this is just to be verbose, perl actually will automatically
create the
        # the hash key for us in the push() inside of the else { block below
        # on the first run, but what the heck, eh?

    $groups{"$cur_grp"} = () if(!exists($groups{"$cur_grp"}));

        # if we already have an entry for the column name in %lastfind, then
        # we're repeating a column name, and must be starting a new
record/group
        # of column names.

    if(exists($lastfind{"$c_name"})) {

            # increment current group/record # and create a new key in our
organization
            # hash.  $groups{"$cur_grp"} points to an array

        $cur_grp++;
        $groups{"$cur_grp"} = ();

            # push an anonymous array into the $groups{"$cur_grp"} array,
containing
            # two elements, the column name, and the position it occured in
@columns

        push(@{ $groups{"$cur_grp"} },[ $c_name,$ent ]);

            # reset %lastfind (so previously matched columns will not
already be matched
            # on future runs.  And create a key for the column we just found
(that was repeated from
            # the previous group)

        %lastfind = ();
        $lastfind{"$c_name"} = 1;
        next;

        } else {

                # otherwise, we don't already have this column name
listed...

            $lastfind{"$c_name"} = 1;

                # add column name and position in @columns to our
$groups{"$cur_grp"}
                # key (which, as stated above, points to an array)

            push(@{ $groups{"$cur_grp"} },[ $c_name,$ent ]);
            next;
            }
    }

    # this assumes that if a column is listed, so is a value
    # that is, if $#columns == 30, then $#values == 30)

    # for each group of columns / record:

foreach my $key (keys(%groups)) {
    print("Record (col group) #: $key\n");

        # $groups{"$key"} is a pointer to an array, where each element
        # is a two-key anonymous array

    foreach my $aRef (@{ $groups{"$key"} }) {
        my $col_name = $aRef->[0];
        my $val_pos = $aRef->[1];

            # as stated above, this assume that $#columns == $#values;

        my $actual_value = $values[$val_pos];

        print("\t$col_name = $actual_value\n");
        }
    }

__DATA__
F1    F2    F3    F4    F5
dataf1    dtaf2    dttaf3    dttttt4    dtt5f
F3    F5    F4    F1
dtf3    dtf5    dtf4    dtf1
F1
dtatf1
F3    F4    F5    F2
df3    df4    df5    df2
F2    F3    F4
dtf2    dtf3    dtf4
--endscript--

Which produces the following output:
C:\temp>perl -w multi2.pl
Record (col group) #: 0
        F1 = dataf1
        F2 = dtaf2
        F3 = dttaf3
        F4 = dttttt4
        F5 = dtt5f
Record (col group) #: 1
        F3 = dtf3
        F5 = dtf5
        F4 = dtf4
        F1 = dtf1
Record (col group) #: 2
        F1 = dtatf1
        F3 = df3
        F4 = df4
        F5 = df5
        F2 = df2
Record (col group) #: 3
        F2 = dtf2
        F3 = dtf3
        F4 = dtf4
C:\temp>

Note how group #2 got to include F3,F4,F5, and F2 -- this can be a bug, or a
feature, if you so please.  The reason is that the next line (which had
F3,F4,F5,F2) of the data file did not have any column names which were
repeated in the previous (which had only F1)


Hope this helps add another piece to your puzzle.

!c (dolljunkie)

_______________________________________________
Perl-Win32-Users mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: multiline parse

Reply via email to