Hi John, Indeed it was the ISBN :) The Amazon page had the correct number of pages (247). I'm not sure Amazon can always be trusted though, as there appear to be paperbacks with 9000 pages in 2.8 inches as well, like <http://openlibrary.org/books/OL11001506M/Ria_Internal_Revenue_Code_August_2005>.
I updated the script so that it exports the weight in the csv too. On Feb 23, 2013 8:40 AM, "John Shutt" <[email protected]> wrote: > > Hey Ben, > > I haven't looked into book dimensions before, but I noticed the book you > linked to seems to have its ISBN-10 listed as its number of pages. But I > doubt that looking at ID numbers would be much help in cleaning up the rest > of the outliers, unless they're using partial numbers pulled from OLIDs. > > John > > On Fri, Feb 22, 2013 at 3:59 PM, Ben Companjen <[email protected]> wrote: >> >> Hi all, >> >> This week I started experimenting a little with book dimensions. In >> light of the machine learning ideas, it should be easy to predict one >> physical dimension from the other dimensions. (Spam entries can >> easilybe detected by looking at dimensions, as only numbers are >> expected - although I've seen kilobytes too.) >> >> In the edit form there are fields for height, width and depth in >> several units, number of pages, format and weight (again several >> units) and you'd think some of these depend on some or all of the >> others: >> ) number of pages + format -> depth >> ) format + height + width + depth -> weight >> etc. >> There is more to it (covers come in different thicknesses, and so does >> paper), but it may be a fun experiment. >> >> In the initial data conversion (just pushed edition2csv.py to [1]), I >> of course tripped multiple times on the myriad of entry forms that are >> in the "physical_dimensions" field in edition records. >> With some regular expressions I could filter out the strings that were >> not [h] x [w] x [d] inches|centimeters. I loaded them into R and tried >> plotting number of pages vs depth and got really funky results. >> Apparently some paperbacks that are less than an inch thick have >> 200000+ pages... >> >> This had the most pages, by far: >> http://openlibrary.org/books/OL8995338M?v=3 >> >> The attached plot shows 1000 paperbacks with outliers. >> >> Were there ever any attempts at using or cleaning up this data? >> Did this spark someone's imagination? ;) >> >> >> Ben >> >> [1] https://github.com/bencomp/oldumpscripts/blob/dev/edition2csv.py ; >> it converts parts of edition records to a predictable csv file. >> >> _______________________________________________ >> Ol-tech mailing list >> [email protected] >> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech >> To unsubscribe from this mailing list, send email to >> [email protected] >> > > > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to > [email protected] > _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
