One technical question about Perl & MARC 520 field.

Background: There is an ETD (electronic theses) project in my school and
students are filling an online form before they publish their theses
online. The form requires "Abstract", which will be used as a resource
of 520 field. Now we are writing a perl program to try to make the data
from the form automatically into a MARC record.

Problem:  The 520 field of MARC records looks funny in the school
library system at the beginning if students start new paragraphs when
they fill the online form. There would be a lot of empty spaces and
empty lines between the 2 paragraphs. 

Later, our technician wrote the following program to correct it:
#Capture the last word of the string.
$abstract =~m/([a-zA-Z0-9\.]+)[.!?]?\s*$/x;
$lastword = $1;
#Remove HTML from abstract
#(for some reason this can eat the last word of the string)
$abstract = HTML::FormatText->new->format(parse_html($abstract));
chomp($abstract);

#Remove newlines/tabs/extra spaces from abstract
$abstract =~ s/\s+/ /gs;
$abstract =~ m/([a-zA-Z0-9\.]+)[.!?]?\s*$/x;
if ($1 ne $lastword){
        $abstract .=" $lastword";
}

The program now could handle a nice short 520 paragraph, however, if the
abstract is very long, some words would be eaten up at the end of some
line like:
"... spe
woody..."

The correct one would read like "specially fine with woody ... "

I wonder whether anyone knows how to handle the problem?

Thanks a lot for the help!

Best,

Clara
 

-----Original Message-----
From: Bryan Baldus [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 09, 2004 12:04 PM
To: [EMAIL PROTECTED]
Subject: MARC error checking with Perl updates and question

I have once again updated by error checking modules [1],
(MARC::)Errorchecks
.pm and (MARC::)Lintadditions.pm. I am running out of new things to
check
for, though I do have a few ideas in mind, including attempting to find
miscoded geographical headings and topical headings (e.g. if "United
States"
appears in a 6xx subfield other than 651$a or 6xx$z, it may be miscoded
(though not always), or if "Dogs" is in 651$a or 6xx$z, it is probably
miscoded), as well as the items in the "Current planned in progress
tasks"
on my site.

I have added a question concerning grep at the end of this message.
Thank
you for any assistance you may be able to provide. 

Changes:

(Aug. 8, 2004):

Module updates:

Errorchecks.pm:
Version 1.01: Updated July 20-Aug. 7, 2004. Released Aug. 8, 2004.

-Temporary (or not) workaround for
check_bk008_vs_bibrefandindex($record)
and bibliographies.
-Removed variables from some error messages and cleanup of messages.
-Code readability cleanup.
-Added subroutines
--check_240ind1vs1xx($record) -- Reports errors based on whether 240 and
1xx
are both present and first indicator is 1 or 0.
--check_041vs008lang($record) -- Compares first code in subfield 'a' of
041
vs. 008 bytes 35-37.
--check_5xxendingpunctuation($record) -- Looks for final punctuation in
several of the 5xx fields.
--findfloatinghypens($record) -- Looks for space-hyphen-space in each
field
(in a list of given fields)
--video007vs300vs538($record) -- In video records, compares 007 values
vs.
300 and 538 fields. Limited to VHS, DVD, and Video CD.
--ldrvalidate($record) -- Checks for valid bytes in the user-changable
leader bytes.
--geogsubjvs043($record) -- Reports missing 043 if 651 or 6xx$z is
present.
has list of exceptions (e.g. English-speaking countries)
--findemptysubfields($record) -- Looks for empty subfields (e.g.
$x$xPsychology.)

Changed subroutines:
-check_bk008_vs_300:
--added cross-checking for codes a, b, c, g (ill., map(s), port(s).,
music)
--added checking for 'p. ' or 'v. ' or 'leaves ' in subfield 'a'
--added checking for 'cm.', 'mm.', 'in.' in subfield 'c'
--revised check for 'm', phono. (which QBI doesn't currently use)
--Added check in check_bk008_vs_bibrefandindex($record) for 'Includes
index.' (or indexes) in 504
---This has a workaround I would like to figure out how to fix

Lintadditions.pm:

version 1.03: Updated July 20-Aug. 7, 2004. Released Aug. 8, 2004.

-Added check_1xx and check_7xx sets.
-Added checks for non-filing indicator in 130, 630, 730, 740 and 830.
-Added indicator check for 700--ind1 == 3 -> error.
-Added validation of 041 against MARC Code List for Languages.
-Added check_028 and check_037.
-Removed some variables from warning messages.
-Added check_050.
-Added check_040 (IOrQBI specific).
-Added check_440 and check_490.
-Added check_246.
-Changed check_245 ending punctuation errors based on MARC21 rule change
vs.
LCRI 1.0C from Nov. 2003.
-Added check for square brackets in 245 $h.
-Added check for 260 ending punctuation.

Added and changed scripts:

Most of these are test scripts created while writing the subroutines
listed
above.

The subroutines in the modules may have code not in the scripts, so it
is
best to use the module rather than the script for those checks (the last
3
full record scripts).

Full record:

-fieldsubfieldcounts.txt -- Field and subfield count--will report totals
for
each tag and subfield.
--First version: Field tag counts only.
-testnewerrorchecks.txt -- Test script to call new subroutines in
Errorchecks.pm (MARC::Errorchecks).
-ldrvalidatescript.txt -- In Errorchecks.pm
-viddvdvsvhs.txt -- In Errorchecks.pm.
-findemptysubfields.txt -- Looks for empty subfields. Skips 037 in
CIP-level
records. In Errorchecks.pm.

Cleanup:
-
-find050doubleperiod.txt -- Test regex for finding pattern in 050$a.
Preliminary code for MARC::Lintadditions::check_050()
-removetitlefromlintrpt.txt -- Removes titles from lintallchecks' output
file.
-findmissing300apunctuation.txt -- Looks for missing period after p or v
in
300a extract file. Initial step for
MARC::Errorchecks::check_bk008_vs_300($record) code.

------------------------------------- 
Question:

In the following code, is there a more efficient way to write the grep
for
"Includes index(es)." to get the same result?

### workaround ###
        my @indexin504 = grep {$_ =~  /Includes(.*)index(es)?(\.)*/;
push
@indexalone, $1.$3; $_ =~ /Includes(.*)index(\.)*/;}
 @fields504;
        #look for 'Includes index.' in 504
        foreach my $indexalonein504 (@indexalone) {
                #report error if have only space between 'Includes' and
'index' (followed by period)
                if ($indexalonein504 =~ /^ \.$/) {
                        push @warningstoreturn, ("504: 'Includes index'
should be 500.")
                } #if index is alone in 504
        } #foreach index alone
----------------------------------------------- 
[1] My home page: http://home.inwave.com/eija/

Thank you,

Bryan Baldus
Cataloger
Quality Books, Inc.
The Best of America's Independent Presses
[EMAIL PROTECTED]
http://home.inwave.com/eija/

Reply via email to