Ben, I love the "down to 4500" :-) Thanks, though, for the work. You are 
right about the bad data -- I looked at some and it was simply badly 
delineated in the input records.

I honestly would love to see the db rebuilt without some of the really 
bad data sources that got in early on. Right now the quality problems 
are making it hard to use the OL. Then again, there's always the tension 
between quality and quantity (breadth).

kc

On 9/17/12 12:36 PM, Ben Companjen wrote:
> Hi all,
>
> Based on the latest data dump (August 31st), I made VacuumBot clean up
> some of the 6791 "formats".
>
> Many formats are badly split MARC title lines (I think because field
> delimiters in the MARC records were (partially) missing) and include
> the "by statement" (e.g. "[microform] / by Jeffrey C. Hyde") or
> subtitle (e.g. "[microform] : European culture studies."). In some
> spare time, I used Google Refine to split these formats to a format
> and by statement or subtitle and used VacuumBot to update the records
> (this task still fits with my definition of cleaning).
>
> If the field (by_statement or subtitle) already existed and was not
> empty, the content was put in (if not empty, then added to) the notes
> field. This is also explained in the edit comment.
> There may be bad data left. I didn't extensively check for partially
> missing field delimiters, for example some subtitles may start with "b
> ".
>
> I also changed many other formats, although those include changing
> "eBook" to "E-book" and even "10 cm." to "10 cm". All in all, the
> number of different "formats" is down to about 4500.
>
> Ben
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
>

-- 
Karen Coyle
[email protected] http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to