Hi Ben,

Thanks, your file is more readable than the JSON file I produced myself.
But the cdump includes all revisions of everything; that's why I use the
latest datadump that includes only the latest revision of all records for
these statistics.

I recently wrote a script that transforms the JSON records to CSV rows
(4-tuples with key, attribute, value type and value) and which includes all
values of all fields. I ran it on the dump of July and it produced about
850 million rows :)

Ben (C)

On 21 August 2012 19:51, Ben Vizzier <[email protected]> wrote:
> I have no idea if this will be of any use to anyone...
>
> I was curious about what a profile of the records to fields in the cdump
file looks like, so I wrote a program that parsed the json in the cdump
records. I used ol_cdump_2012-07-31.txt.gz to generate this Excel
spreadsheet.
>
> ** It only looks at the top level fields and not the subfields. So, for
example, the "authors" field contains key-value-pairs with "key" fields.
The total record counts in the table for the field "key" DOES NOT include
the "key" subfield from the authors field. Thus, the number of key fields
counted is 119,716,115 which is the number f records in
ol_cdump_2012-07-31.txt.gz.
>
> The total column and row are simple sanity checks to see if there are any
outliers such as the " _date" field that only occurs in one author record.
>
> BenV
>
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to
[email protected]
>
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to