To all, Happy Holidays!
I just published Version 3.0 of my GFW research. First of all, I created a "master spreadsheet" for all the findings and updates at http://goo.gl/zKslcu. It contains links to the papers and various lists. Also tweeted here<https://twitter.com/SummerAgony/status/416102422826602496> . <http://goo.gl/zKslcu> There are several major additions in this version (V3.0 is located at http://goo.gl/u971J7): 1, I created a monitoring pipeline which monitors GFW's updates on Wikipedia. (For updates, one can subscribe to the mailing list [email protected]). 2, I applied the methodology to four more areas: A. I examined more than 1 million website names (obtained from Alexa and several online lists: greatfire, autoproxy). I identified 3644 GFW filtering rules targeting website names. This list is significantly more comprehensive and more precise than all precedents. B. I applied the methodology to IMDB, examined 4M titles and identified 6 GFW rules. C. I examined a big repository of AppStore apps (648,567 items) and identified 26 GFW rules. D. I checked 786,432 IP strings and identified 130 GFW rules. 3. 9 new rules (deployed after 2013-10-01) against Wikipedia were discovered. For readers who have seen V2.0 of the paper, the new sections are Section 9 (websites), Section 10 (IP strings), Section 11 (IMDB), Section 12 (AppStore) and Appendix C (list of the 3644 websites). Again, this research is a solo project in my spare time, and people's feedback is greatly appreciated. In particular, if you know large corpus that GFW may filter, I'd love that input. For example, I only examined 1M website names and ~60% of AppStore apps here, if you have a bigger collection of website names or if you have a way to get the full AppStore list, I'd love to take a look. Last but not the least, as I mentioned in the paper, this study was originally motivated by Dr Xu Zhiyong (wiki page<https://en.wikipedia.org/wiki/Xu_Zhiyong>, news search <https://www.google.com/search?q=xu+zhiyong&tbm=nws>), whose Chinese Wikipedia page <http://zh.wikipedia.org/wiki/%E8%AE%B8%E5%BF%97%E6%B0%B8>is (surprisingly) accessible in China (it turned out that GFW blocked a non-standard variant of the page). Dr Xu is currently facing trial in Beijing and may be sentenced to several years in prison, for his peaceful efforts to make China a place with a little bit more freedom, righteousness and love. China's New Citizens' Movement<https://en.wikipedia.org/wiki/New_Citizens%27_Movement_%28China%29>need more support from the world!! Best, Xia Chu On Fri, Oct 18, 2013 at 6:20 PM, 夏楚 <[email protected]> wrote: > To all, > > I just wrote up my new study of GFW and it is available at > http://goo.gl/KfBCgT > > In this new version, I thoroughly studied GFW's HTTP response filtering > scheme, which has not been well studied in the past. The bulk of the new > result is in Section 5 (pp 8-12). The following is some excerpts regarding > the new findings. > > > *Abstract* > > In Version 2.0, we studied GFW's filtering rules for HTTP responses > extensively and identified a comprehensive list (including those affecting > Wikipedia and beyond). This list is small (19 items) but they affect many > more pages on Wikipedia and other websites. > > *Section 5.3 Learnings and Mysteries of GFW's HTTP Response Filtering* > > > - GFW's HTTP request filtering and response filtering are two separate > systems. For one, their filtering rules are entirely different. For two, > GFW's HTTP request filtering is homogeneous and has near perfect trigger > rate, but GFW's HTTP response filtering varies hugely, not only in the > triggering rates, but also in the filtering rules in effect. For example, > CERNET (Chinese Education and Research Network) seems to have all the rules > in place, but some other ISPs only have a subset. > > > - One remarkable finding is that GFW does not just look at individual > TCP packet, but instead, it ``remembers'' the entire TCP session to look > for offenders. This becomes evident when the filtering rule is ``\$term\_A > \& \$term\_B'', and the two terms show up far apart (hundreds of thousands > bytes from each other) on a webpage, GFW will still be able to reset the > connection. To achieve this requires significant investment in > infrastructure, and it is probably also the reason why the rulebook is so > much smaller for HTTP response filtering than HTTP request filtering. > > > Best, > > On Mon, Sep 30, 2013 at 4:26 PM, 夏楚 <[email protected]> wrote: > >> To all, >> >> I just finished writing up my research on GFW (Great Firewall of China) >> blacklist for Wikipedia. Some of you might find it interesting. >> >> The paper can be found at goo.gl/RnMvG1 (tweeted >> here<https://twitter.com/SummerAgony/status/384820318402920448>). >> Here I paste excerpts from the Abstract and Conclusions below. >> >> *Abstract* >> >> In this report, we detail the *complete* and *exact* rulebook that the >> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook'' >> (instead of the common term "blacklist'') because we not only identify the >> blacklisted terms, but also the exact string matching rules deployed by >> GFW. An efficient probing methodology makes this possible. >> >> ... >> Wikipedia contains millions of pages, e.g. more than 700,000 articles for >> the Chinese version, and more than 4,240,000 articles for the English >> version. It seems a daunting and unfeasible task to test these pages >> exhaustively, hence there has been no well known attempt to gather the >> complete blacklist. >> >> While a small sample of the blacklist is useful, the complete picture >> can be much more powerful in revealing the underlying works of GFW and >> its operators. In this study, we devised a methodology which efficiently >> examines the entire Wikipedia corpus, hence exposing to the world the >> complete GFW rulebook for Wikipedia the first time. In total, there are 919 >> rules (excluding URL terms) which are applicable to Wikipedia, affecting >> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages. >> >> The revealed rulebook also demonstrates that the GFW operation is >> haphazard and ill-maintained. At the same time, Chinese >> censorship bureaucracy *intends* to be thorough and extensive. >> >> To be precise, the findings in this report are on two Wikipedia >> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the >> English version. >> >> *Conclusion Remarks* >> >> In this study, we examined the entire Wikipedia corpus (Chinese version >> and English version) and revealed the complete and exact GFW rulebook for >> Wikipedia (with caveats described in Section 6). >> >> A sample of notable findings are: >> >> - There are 78 terms for which GFW blocks a non-standard variant but >> not the canonical path. These are cases the censors intend to block but >> the >> block does not really happen, suggesting the censors have poor >> understanding of Wikipedia's serving system. >> - Many obscure non-article pages are blocked, which raises suspicion >> that these pages were provided to the censorship bureaucrats by Wikipedia >> editors who are very familiar with the content (e.g. those who >> participated >> in the edit wars and/or discussions regarding self-censorship proposals). >> - GFW string matching rules have a 64-byte hard limit of size. >> >> The biggest learning out of this study, in my opinion, is that GFW >> operation >> is haphazard and ill-maintained. Also, there are many indications that the >> GFW operators are somewhat disconnected from the censorship bureaucrats. >> >> We hope the revealing can be of interest to internet censorship watchers, >> Wikipedia researchers, China observers, and ordinary Chinese citizens. >> >> >> -- >> Xia Chu (Twitter: @summer.agony) >> > > > -- > Xia Chu (Twitter: @summer.agony) > -- -- Xia Chu (Twitter: @summer.agony; Google+: gplus.to/summer.agony)
-- Liberationtech is public & archives are searchable on Google. Violations of list guidelines will get you moderated: https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, change to digest, or change password by emailing moderator at [email protected].
