Hi everyone,
I'm learning to use regular expressions. Would you please give me advice about
how to proceed.
Thanks,
Kathy
I have written a perl program that parses a file and organizes the data into
records with 3 fields. Then this data is imported into FileMaker. Sample data is
below. The S numbers are unique identifiers. Every line with the same S number
belongs to one record.The 3070050 and 3070055 numbers are codes. Each line with
the 3070050 code belongs to the diagnosis field, and each line with the 3070055
code belongs to the comment field.
My problem is the formatting for the 2 multi-line fields. These fields end up
with a lot of extraneous white space in them. Sample data from the original file
that goes into the diagnosis field of one record:
Right peritoneal "tissue" (cell block section):^
1. Mesothelial cells, wbcs, and blood clot.^
2. No evidence of malignancy.^
CR-0^
looks like this in the out file (all one one line):
Right peritoneal "tissue" (cell block section):^
1. Mesothelial cells, wbcs, and blood clot.^
2. No evidence of malignancy.^ CR-0^
So I used this regular expression to replace 2 or more consecutive spaces with
one space:
$textline =~ s/\s{2,}/ /g;
Now the data looks like this (all one one line):
Right peritoneal "tissue" (cell block section):^ 1. Mesothelial cells, wbcs,
and blood clot.^ 2. No evidence of malignancy.^ CR-0^
I am still left with ^ and CR-0. I can remove those. But, is there any way to
preserve the formatting of the text? Much of the data is in multi-line, outline
format. That looks awful without the carriage returns. I tried adding a \r at
the end of each line from the input file. This is what the data looks like after
I add an \r at the end of each line from the input file:
Right peritoneal "tissue" (cell block section):^
1. Mesothelial cells, wbcs, and blood clot.^
2. No evidence of malignancy.^
CR-0^
BUT This is what it looks like after I import it into FileMaker:
Right peritoneal "tissue" (cell block section):^
1. Mesothelial cells, wbcs, and blood clot.^
2. No evidence of malignancy.^
CR-0^
Here's some of the original data:
00 S2495 3070050 Tenosynovium, left hand; excisional bioppsy:^
00 S2495 3070050 Connective tissue with mild myxomatous change. No
00 S2495 3070050 inflammation^
00 S2495 3070050 CR-0 ^
00 S2495 3070055 CR-0 ^
00 S0162 3070050 Mediastinal mass:^
00 S0162 3070050 Hodgkin's lymphoma, nodular sclerosing type- see
00 S0162 3070055 Sections stained by the immunoperoxidase technique with
00 S0162 3070055 CD15, CD30, CD45, CD20, CD79a and CD3 revealed that large
00 S0162 3070055 cytologic characteristics of Reed-Sternberg cells and
00 S0162 3070055 CD15, CD30 positive, CD45, CD20, CD79a, CD3 negative
00 S0162 3070055 with Hodgkin's lymphoma. Flow cytometric analysis
00 S0162 3070055 or T cell clonality, consistent with that diagnosis. ^
00 S0199 3070050 Right peritoneal "tissue" (cell block section):^
00 S0199 3070050 1. Mesothelial cells, wbcs, and blood clot.^
00 S0199 3070050 2. No evidence of malignancy.^
00 S0199 3070050 CR-0^
00 S0199 3070055 The cells in the clot section are strongly
00 S0199 3070055 and cytokeratins AE1/3 and are negative for PSA. Some of
00 S0199 3070055 CK5/6. The findings are consistent with mesothelial
00 S0199 3070055 are identified. ^
00 S0256 3070050 Liver; needle core biopsy:^
00 S0256 3070050 Liver with marked portal and periportal fibrosis. No
00 S0256 3070050 lymphoproliferative disease (see Comment).^
00 S0256 3070050 CR-0 ^
00 S0256 3070055 Trichrome and reticulin stain confirm the degree of
00 S0256 3070055 shows 1+ positivity in hepatocytes and portal
00 S0256 3070055 Dr Smith has reviewed the material and aggrees.