Ben, I love the "down to 4500" :-) Thanks, though, for the work. You are right about the bad data -- I looked at some and it was simply badly delineated in the input records.
I honestly would love to see the db rebuilt without some of the really bad data sources that got in early on. Right now the quality problems are making it hard to use the OL. Then again, there's always the tension between quality and quantity (breadth). kc On 9/17/12 12:36 PM, Ben Companjen wrote: > Hi all, > > Based on the latest data dump (August 31st), I made VacuumBot clean up > some of the 6791 "formats". > > Many formats are badly split MARC title lines (I think because field > delimiters in the MARC records were (partially) missing) and include > the "by statement" (e.g. "[microform] / by Jeffrey C. Hyde") or > subtitle (e.g. "[microform] : European culture studies."). In some > spare time, I used Google Refine to split these formats to a format > and by statement or subtitle and used VacuumBot to update the records > (this task still fits with my definition of cleaning). > > If the field (by_statement or subtitle) already existed and was not > empty, the content was put in (if not empty, then added to) the notes > field. This is also explained in the edit comment. > There may be bad data left. I didn't extensively check for partially > missing field delimiters, for example some subtitles may start with "b > ". > > I also changed many other formats, although those include changing > "eBook" to "E-book" and even "10 cm." to "10 cm". All in all, the > number of different "formats" is down to about 4500. > > Ben > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] > -- Karen Coyle [email protected] http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
