Congratulations, this is impressive work. I am also completely jealous -- a
colleague and myself will be releasing a similar report for Iran in the
next two weeks. This is intended at a broader global project on Wikipedia
censorship ({{Citation Filtered}}) that I would hope might merge well into
what you are doing.On Mon, Sep 30, 2013 at 7:26 PM, 夏楚 <[email protected]> wrote: > To all, > > I just finished writing up my research on GFW (Great Firewall of China) > blacklist for Wikipedia. Some of you might find it interesting. > > The paper can be found at goo.gl/RnMvG1 (tweeted > here<https://twitter.com/SummerAgony/status/384820318402920448>). > Here I paste excerpts from the Abstract and Conclusions below. > > *Abstract* > > In this report, we detail the *complete* and *exact* rulebook that the > Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook'' > (instead of the common term "blacklist'') because we not only identify the > blacklisted terms, but also the exact string matching rules deployed by > GFW. An efficient probing methodology makes this possible. > > ... > Wikipedia contains millions of pages, e.g. more than 700,000 articles for > the Chinese version, and more than 4,240,000 articles for the English > version. It seems a daunting and unfeasible task to test these pages > exhaustively, hence there has been no well known attempt to gather the > complete blacklist. > > While a small sample of the blacklist is useful, the complete picture > can be much more powerful in revealing the underlying works of GFW and > its operators. In this study, we devised a methodology which efficiently > examines the entire Wikipedia corpus, hence exposing to the world the > complete GFW rulebook for Wikipedia the first time. In total, there are 919 > rules (excluding URL terms) which are applicable to Wikipedia, affecting > 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages. > > The revealed rulebook also demonstrates that the GFW operation is > haphazard and ill-maintained. At the same time, Chinese > censorship bureaucracy *intends* to be thorough and extensive. > > To be precise, the findings in this report are on two Wikipedia > snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the > English version. > > *Conclusion Remarks* > > In this study, we examined the entire Wikipedia corpus (Chinese version > and English version) and revealed the complete and exact GFW rulebook for > Wikipedia (with caveats described in Section 6). > > A sample of notable findings are: > > - There are 78 terms for which GFW blocks a non-standard variant but > not the canonical path. These are cases the censors intend to block but the > block does not really happen, suggesting the censors have poor > understanding of Wikipedia's serving system. > - Many obscure non-article pages are blocked, which raises suspicion > that these pages were provided to the censorship bureaucrats by Wikipedia > editors who are very familiar with the content (e.g. those who participated > in the edit wars and/or discussions regarding self-censorship proposals). > - GFW string matching rules have a 64-byte hard limit of size. > > The biggest learning out of this study, in my opinion, is that GFW > operation > is haphazard and ill-maintained. Also, there are many indications that the > GFW operators are somewhat disconnected from the censorship bureaucrats. > > We hope the revealing can be of interest to internet censorship watchers, > Wikipedia researchers, China observers, and ordinary Chinese citizens. > > > -- > Xia Chu (Twitter: @summer.agony) > > -- > Liberationtech is public & archives are searchable on Google. Violations > of list guidelines will get you moderated: > https://mailman.stanford.edu/mailman/listinfo/liberationtech. > Unsubscribe, change to digest, or change password by emailing moderator at > [email protected]. > -- *Collin David Anderson* averysmallbird.com | @cda | Washington, D.C.
-- Liberationtech is public & archives are searchable on Google. Violations of list guidelines will get you moderated: https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, change to digest, or change password by emailing moderator at [email protected].
