One technical question about Perl & MARC 520 field. Background: There is an ETD (electronic theses) project in my school and students are filling an online form before they publish their theses online. The form requires "Abstract", which will be used as a resource of 520 field. Now we are writing a perl program to try to make the data from the form automatically into a MARC record.
Problem: The 520 field of MARC records looks funny in the school library system at the beginning if students start new paragraphs when they fill the online form. There would be a lot of empty spaces and empty lines between the 2 paragraphs. Later, our technician wrote the following program to correct it: #Capture the last word of the string. $abstract =~m/([a-zA-Z0-9\.]+)[.!?]?\s*$/x; $lastword = $1; #Remove HTML from abstract #(for some reason this can eat the last word of the string) $abstract = HTML::FormatText->new->format(parse_html($abstract)); chomp($abstract); #Remove newlines/tabs/extra spaces from abstract $abstract =~ s/\s+/ /gs; $abstract =~ m/([a-zA-Z0-9\.]+)[.!?]?\s*$/x; if ($1 ne $lastword){ $abstract .=" $lastword"; } The program now could handle a nice short 520 paragraph, however, if the abstract is very long, some words would be eaten up at the end of some line like: "... spe woody..." The correct one would read like "specially fine with woody ... " I wonder whether anyone knows how to handle the problem? Thanks a lot for the help! Best, Clara -----Original Message----- From: Bryan Baldus [mailto:[EMAIL PROTECTED] Sent: Monday, August 09, 2004 12:04 PM To: [EMAIL PROTECTED] Subject: MARC error checking with Perl updates and question I have once again updated by error checking modules [1], (MARC::)Errorchecks .pm and (MARC::)Lintadditions.pm. I am running out of new things to check for, though I do have a few ideas in mind, including attempting to find miscoded geographical headings and topical headings (e.g. if "United States" appears in a 6xx subfield other than 651$a or 6xx$z, it may be miscoded (though not always), or if "Dogs" is in 651$a or 6xx$z, it is probably miscoded), as well as the items in the "Current planned in progress tasks" on my site. I have added a question concerning grep at the end of this message. Thank you for any assistance you may be able to provide. Changes: (Aug. 8, 2004): Module updates: Errorchecks.pm: Version 1.01: Updated July 20-Aug. 7, 2004. Released Aug. 8, 2004. -Temporary (or not) workaround for check_bk008_vs_bibrefandindex($record) and bibliographies. -Removed variables from some error messages and cleanup of messages. -Code readability cleanup. -Added subroutines --check_240ind1vs1xx($record) -- Reports errors based on whether 240 and 1xx are both present and first indicator is 1 or 0. --check_041vs008lang($record) -- Compares first code in subfield 'a' of 041 vs. 008 bytes 35-37. --check_5xxendingpunctuation($record) -- Looks for final punctuation in several of the 5xx fields. --findfloatinghypens($record) -- Looks for space-hyphen-space in each field (in a list of given fields) --video007vs300vs538($record) -- In video records, compares 007 values vs. 300 and 538 fields. Limited to VHS, DVD, and Video CD. --ldrvalidate($record) -- Checks for valid bytes in the user-changable leader bytes. --geogsubjvs043($record) -- Reports missing 043 if 651 or 6xx$z is present. has list of exceptions (e.g. English-speaking countries) --findemptysubfields($record) -- Looks for empty subfields (e.g. $x$xPsychology.) Changed subroutines: -check_bk008_vs_300: --added cross-checking for codes a, b, c, g (ill., map(s), port(s)., music) --added checking for 'p. ' or 'v. ' or 'leaves ' in subfield 'a' --added checking for 'cm.', 'mm.', 'in.' in subfield 'c' --revised check for 'm', phono. (which QBI doesn't currently use) --Added check in check_bk008_vs_bibrefandindex($record) for 'Includes index.' (or indexes) in 504 ---This has a workaround I would like to figure out how to fix Lintadditions.pm: version 1.03: Updated July 20-Aug. 7, 2004. Released Aug. 8, 2004. -Added check_1xx and check_7xx sets. -Added checks for non-filing indicator in 130, 630, 730, 740 and 830. -Added indicator check for 700--ind1 == 3 -> error. -Added validation of 041 against MARC Code List for Languages. -Added check_028 and check_037. -Removed some variables from warning messages. -Added check_050. -Added check_040 (IOrQBI specific). -Added check_440 and check_490. -Added check_246. -Changed check_245 ending punctuation errors based on MARC21 rule change vs. LCRI 1.0C from Nov. 2003. -Added check for square brackets in 245 $h. -Added check for 260 ending punctuation. Added and changed scripts: Most of these are test scripts created while writing the subroutines listed above. The subroutines in the modules may have code not in the scripts, so it is best to use the module rather than the script for those checks (the last 3 full record scripts). Full record: -fieldsubfieldcounts.txt -- Field and subfield count--will report totals for each tag and subfield. --First version: Field tag counts only. -testnewerrorchecks.txt -- Test script to call new subroutines in Errorchecks.pm (MARC::Errorchecks). -ldrvalidatescript.txt -- In Errorchecks.pm -viddvdvsvhs.txt -- In Errorchecks.pm. -findemptysubfields.txt -- Looks for empty subfields. Skips 037 in CIP-level records. In Errorchecks.pm. Cleanup: - -find050doubleperiod.txt -- Test regex for finding pattern in 050$a. Preliminary code for MARC::Lintadditions::check_050() -removetitlefromlintrpt.txt -- Removes titles from lintallchecks' output file. -findmissing300apunctuation.txt -- Looks for missing period after p or v in 300a extract file. Initial step for MARC::Errorchecks::check_bk008_vs_300($record) code. ------------------------------------- Question: In the following code, is there a more efficient way to write the grep for "Includes index(es)." to get the same result? ### workaround ### my @indexin504 = grep {$_ =~ /Includes(.*)index(es)?(\.)*/; push @indexalone, $1.$3; $_ =~ /Includes(.*)index(\.)*/;} @fields504; #look for 'Includes index.' in 504 foreach my $indexalonein504 (@indexalone) { #report error if have only space between 'Includes' and 'index' (followed by period) if ($indexalonein504 =~ /^ \.$/) { push @warningstoreturn, ("504: 'Includes index' should be 500.") } #if index is alone in 504 } #foreach index alone ----------------------------------------------- [1] My home page: http://home.inwave.com/eija/ Thank you, Bryan Baldus Cataloger Quality Books, Inc. The Best of America's Independent Presses [EMAIL PROTECTED] http://home.inwave.com/eija/