Hi Lars, Op 19-5-2010 13:45, Lars Aronsson schreef: > Wikipedia was created in 2001 and the image bank Wikimedia Commons > a few years later. It now contains 6 million files, mostly images. > Most of them use the template:Information which has a Date= field > to indicate when the content was created. The ideal format is the > ISO date format YYYY-MM-DD, but this is not always followed. When > I tried to parse the year, I was successful for 3.5 million files. > (Maybe I didn't try very hard.) > I guess you used a regex. Which one exactly? Or did you publish your code somewhere? > So, when were our files created? Of course, most were created > after Wikipedia was founded, in the most recent decade. > Even for old buildings, new photos were taken and uploaded. > > For older decades, we should expect more information for more > recent ones, since more cameras were in used and more books > published with each new decade. Exactly how big has that > growth rate been? > > It turns out, we have roughly 2% more files for each new year. > A graph plotting each year is very bumpy, but if sum up each > decade, it becomes quite smooth. This does not mean that content > production increased with 2% annually, but the content that > survived and was copied to Wikimedia Commons has grown this fast. > > But this is only true for the years between 1750 and 1900. > > For years before 1750, before enlightenment, the growth rate > is only 0.5 percent annually. Also quite reasonable. > > The real surprise is that after 1900, there is no growth. > We have roughly 30,000 files from each decade in the > 20th century. These are the numbers I found: > > 1850s 8652 files > 1860s 12144 > 1870s 16561 > 1880s 19382 > 1890s 25985 > 1900s 37936 > 1910s 34882 > 1920s 23715 > 1930s 24507 > 1940s 30720 > 1950s 29364 > 1960s 24164 > 1970s 23991 > 1980s 31185 > 1990s 45423 > 2000s 2,951,138 files > > And the graph is found on > http://commons.wikimedia.org/wiki/File:Wikimedia_Commons_files_per_decade.png > > My guess is that this is an effect of copyright laws, > which locks down the use of 20th century content. >
Nice stats! I wonder how the distribution is of years with the images of the batch uploads and if this influences the overall statistics. Do you have a list of the 2.5 M files you couldn't parse? We might be able to add dates to some of these images or convert the dates to the ISO format. Maarten _______________________________________________ Commons-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/commons-l
