Re: [MacPerl] regular expressions and formatting text

Bart Lateur Thu, 14 Jun 2001 05:04:30 -0700
On 13 Jun 2001 20:37:56 EDT, Katherine Richmond wrote:

>I have written a perl program that parses a file and organizes the data into
>records with 3 fields. Then this data is imported into FileMaker. Sample data is
>below. The S numbers are unique identifiers. Every line with the same S number
>belongs to one record.The 3070050 and 3070055 numbers are codes. Each line with
>the 3070050 code belongs to the diagnosis field, and each line with the 3070055
>code belongs to the comment field. 

3? I can only see 2.

>My problem is the formatting for the 2 multi-line fields. These fields end up
>with a lot of extraneous white space in them.

>So I used this regular expression to replace 2 or more consecutive spaces with
>one space:
>
>$textline =~ s/\s{2,}/ /g;

Eh, wrong. This strips whitespace, not just spaces (and tabs). IT will
strip newlines too. You don't really want to get rid of those.

        s/[ \t]+/ /g;

or, using tr with the /s modifier (I think of it as "single" character
remaining)

        tr/\t / /s;


My interpretation of the data is as follows:

* lines start with two zero's, a space, an S with 4 digits, 3 spaces, 7
digits, a space. As you said: the S number is a record identifier, the 7
digit number is for the diagnosis (370050) or the comment (370055)
field, and one more....

* "^" at the end of a line represents a hard return. No "^" is a soft
return, i.e. it may be joined with the next line.

 * "CR-0" is an empty line marker, when on a line on its own (apart from
spaces and the "^" end-of-line marker.

So here's my code, with the data you posted in the __DATA__ section:

#! perl -w
my %field = ( '3070050' => 'Diagnosis', '3070055' => 'Comment');
my(%record, $id);
while(<DATA>) {
    my($record_id, $fieldtype, $text) = /^\d+ S(\d+) +(\d{7}) (.*)/ or
next;
    if(defined $id && $id != $record_id) {
        flush($id, \%record);  # record complete
        %record = ();
    }
    $id = $record_id;
    for($text) {
        s/\^\s*$/\n/ or $_ .= " ";
        s/^CR-0 *$//;
    }
    $record{$field{$fieldtype}} .= $text;
}
flush($id, \%record) if defined $id;    # last record

sub flush {
    my($id, $record) = @_;
    foreach(@{$record}{keys %record}) {  # values
        tr/ \t/ /s;
        s/\s+$//;       # remove trailing newlines and spaces
        s/^ //g;        # remove leading space on first line, or:
        # s/^ //gm;     # remove leading space on each line
    }
    use Data::Dumper;
    print "Record: $id\n", Dumper $record;
}

__DATA__
... data follows

-- 
        Bart.
Re: [MacPerl] regular expressions and formatting text

Reply via email to