Thanks Colin for your nice words! I just uploaded the "rulebook" to Google Spreadsheet at http://goo.gl/zKslcu. It can be used in whatever way, but do read the paper (http://<http://goo.gl/RnMvG1> goo.gl/RnMvG1) first to avoid misinterpretation.
Best, On Tue, Oct 1, 2013 at 12:45 PM, Collin Anderson <[email protected]>wrote: > Congratulations, this is impressive work. I am also completely jealous -- > a colleague and myself will be releasing a similar report for Iran in the > next two weeks. This is intended at a broader global project on Wikipedia > censorship ({{Citation Filtered}}) that I would hope might merge well into > what you are doing. > > > On Mon, Sep 30, 2013 at 7:26 PM, 夏楚 <[email protected]> wrote: > >> To all, >> >> I just finished writing up my research on GFW (Great Firewall of China) >> blacklist for Wikipedia. Some of you might find it interesting. >> >> The paper can be found at goo.gl/RnMvG1 (tweeted >> here<https://twitter.com/SummerAgony/status/384820318402920448>). >> Here I paste excerpts from the Abstract and Conclusions below. >> >> *Abstract* >> >> In this report, we detail the *complete* and *exact* rulebook that the >> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook'' >> (instead of the common term "blacklist'') because we not only identify the >> blacklisted terms, but also the exact string matching rules deployed by >> GFW. An efficient probing methodology makes this possible. >> >> ... >> Wikipedia contains millions of pages, e.g. more than 700,000 articles for >> the Chinese version, and more than 4,240,000 articles for the English >> version. It seems a daunting and unfeasible task to test these pages >> exhaustively, hence there has been no well known attempt to gather the >> complete blacklist. >> >> While a small sample of the blacklist is useful, the complete picture >> can be much more powerful in revealing the underlying works of GFW and >> its operators. In this study, we devised a methodology which efficiently >> examines the entire Wikipedia corpus, hence exposing to the world the >> complete GFW rulebook for Wikipedia the first time. In total, there are 919 >> rules (excluding URL terms) which are applicable to Wikipedia, affecting >> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages. >> >> The revealed rulebook also demonstrates that the GFW operation is >> haphazard and ill-maintained. At the same time, Chinese >> censorship bureaucracy *intends* to be thorough and extensive. >> >> To be precise, the findings in this report are on two Wikipedia >> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the >> English version. >> >> *Conclusion Remarks* >> >> In this study, we examined the entire Wikipedia corpus (Chinese version >> and English version) and revealed the complete and exact GFW rulebook for >> Wikipedia (with caveats described in Section 6). >> >> A sample of notable findings are: >> >> - There are 78 terms for which GFW blocks a non-standard variant but >> not the canonical path. These are cases the censors intend to block but >> the >> block does not really happen, suggesting the censors have poor >> understanding of Wikipedia's serving system. >> - Many obscure non-article pages are blocked, which raises suspicion >> that these pages were provided to the censorship bureaucrats by Wikipedia >> editors who are very familiar with the content (e.g. those who >> participated >> in the edit wars and/or discussions regarding self-censorship proposals). >> - GFW string matching rules have a 64-byte hard limit of size. >> >> The biggest learning out of this study, in my opinion, is that GFW >> operation >> is haphazard and ill-maintained. Also, there are many indications that the >> GFW operators are somewhat disconnected from the censorship bureaucrats. >> >> We hope the revealing can be of interest to internet censorship watchers, >> Wikipedia researchers, China observers, and ordinary Chinese citizens. >> >> >> -- >> Xia Chu (Twitter: @summer.agony) >> >> -- >> Liberationtech is public & archives are searchable on Google. Violations >> of list guidelines will get you moderated: >> https://mailman.stanford.edu/mailman/listinfo/liberationtech. >> Unsubscribe, change to digest, or change password by emailing moderator at >> [email protected]. >> > > > > -- > *Collin David Anderson* > averysmallbird.com | @cda | Washington, D.C. > > -- > Liberationtech is public & archives are searchable on Google. Violations > of list guidelines will get you moderated: > https://mailman.stanford.edu/mailman/listinfo/liberationtech. > Unsubscribe, change to digest, or change password by emailing moderator at > [email protected]. > -- -- Xia Chu (Twitter: @summer.agony; Google+: gplus.to/summer.agony)
-- Liberationtech is public & archives are searchable on Google. Violations of list guidelines will get you moderated: https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, change to digest, or change password by emailing moderator at [email protected].
