Hi Ben, Thanks, your file is more readable than the JSON file I produced myself. But the cdump includes all revisions of everything; that's why I use the latest datadump that includes only the latest revision of all records for these statistics.
I recently wrote a script that transforms the JSON records to CSV rows (4-tuples with key, attribute, value type and value) and which includes all values of all fields. I ran it on the dump of July and it produced about 850 million rows :) Ben (C) On 21 August 2012 19:51, Ben Vizzier <[email protected]> wrote: > I have no idea if this will be of any use to anyone... > > I was curious about what a profile of the records to fields in the cdump file looks like, so I wrote a program that parsed the json in the cdump records. I used ol_cdump_2012-07-31.txt.gz to generate this Excel spreadsheet. > > ** It only looks at the top level fields and not the subfields. So, for example, the "authors" field contains key-value-pairs with "key" fields. The total record counts in the table for the field "key" DOES NOT include the "key" subfield from the authors field. Thus, the number of key fields counted is 119,716,115 which is the number f records in ol_cdump_2012-07-31.txt.gz. > > The total column and row are simple sanity checks to see if there are any outliers such as the " _date" field that only occurs in one author record. > > BenV > > _______________________________________________ > Ol-tech mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech > To unsubscribe from this mailing list, send email to [email protected] >
_______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
