Hi all,

Based on the latest data dump (August 31st), I made VacuumBot clean up
some of the 6791 "formats".

Many formats are badly split MARC title lines (I think because field
delimiters in the MARC records were (partially) missing) and include
the "by statement" (e.g. "[microform] / by Jeffrey C. Hyde") or
subtitle (e.g. "[microform] : European culture studies."). In some
spare time, I used Google Refine to split these formats to a format
and by statement or subtitle and used VacuumBot to update the records
(this task still fits with my definition of cleaning).

If the field (by_statement or subtitle) already existed and was not
empty, the content was put in (if not empty, then added to) the notes
field. This is also explained in the edit comment.
There may be bad data left. I didn't extensively check for partially
missing field delimiters, for example some subtitles may start with "b
".

I also changed many other formats, although those include changing
"eBook" to "E-book" and even "10 cm." to "10 cm". All in all, the
number of different "formats" is down to about 4500.

Ben
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to