How about adding a disclaimer line to the webdatacommons.org site like "Note that the many database-backed sites contain a huge long tail of rarely-visited, rarely-linked pages (e.g. product catalogues), but which increasingly contain useful structured data. It is best not to assume that this collection contains a complete, deep crawl of every site it touches."
Dan
