Author: Carl Friedrich Bolz-Tereick <[email protected]> Branch: extradoc Changeset: r5963:73fb50383067 Date: 2019-10-05 13:49 +0200 http://bitbucket.org/pypy/extradoc/changeset/73fb50383067/
Log: some more text diff --git a/blog/draft/2019-10-json.rst b/blog/draft/2019-10-json.rst --- a/blog/draft/2019-10-json.rst +++ b/blog/draft/2019-10-json.rst @@ -223,4 +223,119 @@ (simplejson), with ujson, with the JSON parser of Node/V8 and with RapidJSON (in DOM mode). -$$$ + +I collected a number of medium-to-large JSON files to try the JSON +parsers on: + +- `Censys <https://censys.io/data>`__: A subset of the Censys port and + protocol scan data for websites in the Alexa top million domains +- `Gharchive <https://www.gharchive.org>`__: Github activity from + January 15-23, 2015 from Github Archive +- `Reddit <https://files.pushshift.io/reddit/comments/>`__: Reddit + comments from May 2009 +- Rosie: The nested matches produced using the `Rosie pattern + language <https://rosie-lang.org/>`__ ``all.things`` pattern on a log + file +- Nytimes: Metadata of a collection of New York Times articles +- Tpch: The TPC-H database benchmark's deals table as a JSON file +- Twitter: A JSON export of the @pypyproject Twitter account data +- Wikidata: A file storing a subset of the Wikidata fact dump from Nov + 11, 2014 +- `Yelp <https://www.yelp.com/dataset/download>`__: A file of yelp + businesses + +Here are the file sizes of the benchmarks: + +.. raw:: html + + <table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th>Benchmark</th>\n <th>File Size [MiB]</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>Censys</td>\n <td>898.45</td>\n </tr>\n <tr>\n <td>Gharchive</td>\n <td>276.34</td>\n </tr>\n <tr>\n <td>NYTimes</td>\n <td>12.98</td>\n </tr>\n <tr>\n <td>Reddit</td>\n <td>931.65</td>\n </tr>\n <tr>\n <td>Rosie</td>\n <td>388.88</td>\n </tr>\n <tr>\n <td>TPCH</td>\n <td>173.86</td>\n </tr>\n <tr>\n <td>Wikidata</td>\n <td>119.75</td>\n </tr>\n <tr>\n <td>Yelp</td>\n <td>167.61</td>\n </tr>\n </tbody>\n</table> + + +I measured the times of each benchmark with a number of variations +of the improved PyPy algorithms: + +- PyPyBaseline: The PyPy JSON parser as it was before my work with JSON + parsing started (PyPy version 5.8) +- PyPyKeyStringCaching: Memoizing the key strings of dictionaries, but + not the other strings in a json file, and not using maps to represent + dictionaries (this is the JSON parser that PyPy has been shipping since + version 5.9, in the benchmarks I used 7.1). +- PyPyMapNoCache: Like PyPyKeyStringCaching, but using maps to + represent dictionaries. This includes speculatively parsing the next + key using memcmp, but does not use string caching of non-key strings. +- PyPyFull: Like PyPyMapNoCache but uses a string cache for all + strings, not just keys. This is equivalent to what is in PyPy 7.2 + +In addition to wall clock time of parsing, I also measured the increase +in memory use of each implementation after the input string has been +deserialized, i.e. the size of the in-memory representation of every +JSON file. + + +Contributions of Individual Optimizations +--------------------------------------------- + +Let's first look at the contributions of the individual optimizations to the +overall performance and memory usage. + +$$ two graphs + +All the benchmarks were run 30 times in new processes, all the numbers are +normalized to PyPyFull. + +The biggest individual improvement to both parsing time and memory used comes +from caching just the keys in parsed dictionaries. This is the optimization in +PyPy's JSON parser that has been implemented for a while already. To understand +why this optimization is so useful, let's look at some numbers about each +benchmark, namely the number of total keys across all dictionaries in each +file, as well as the number of unique keys. As we can see, for all benchmarks +the number of unique keys is significantly smaller than the number of keys in +total. + +.. raw:: html + + <table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th>Benchmark</th>\n <th>Number of keys</th>\n <th>Number of unique keys</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>Censys</td>\n <td>14\u2009404\u2009234</td>\n <td>163</td>\n </tr>\n <tr>\n <td>Gharchive</td>\n <td>6\u2009637\u2009881</td>\n <td>169</td>\n </tr>\n <tr>\n <td>NYTimes</td>\n <td>417\u2009337</td>\n <td>60</td>\n </tr>\n <tr>\n <td>Reddit</td>\n <td>25\u2009226\u2009397</td>\n <td>21</td>\n </tr>\n <tr>\n <td>Rosie</td>\n <td>28\u2009500\u2009101</td>\n <td>5</td>\n </tr>\n <tr>\n <td>TPCH</td>\n <td>6\u2009700\u2009000</td>\n <td>45</td>\n </tr>\n <tr>\n <td>Wikidata</td>\n <td>6\u2009235\u2009088</td>\n <td>1\u2009602</td>\n </tr>\n <tr>\n <td>Yelp</td>\n <td>5\u2009133\u2009914</td>\n <td>6 1</td>\n </tr>\n </tbody>\n</table> + + +The next big jump in deserialization time and memory comes from introducing +maps to represent deserialized dictionaries. With PyPyMapNoCache +deserialization time goes down because it's much cheaper to walk the tree +of maps and store all deserialized objects into an array of values than to +build hashmaps with the same keys again and again. Memory use goes down +for the same reason: it takes a lot less memory to store the shared +structure of each set of keys in the map, as opposed to repeating it again +and again in every hashmap. + +We can look at some numbers about every benchmark again. The table shows how +many map-based dictionaries are deserialized for every benchmark, and how many +hashmap-backed dictionaries. We see that the number of hashmap-backed +dictionaries is often zero, or at most a small percentage of all dictionaries +in each benchmark. Yelp has the biggest number of hashmap-backed dictionaries. +The reason for this is that the input file contains hashmaps that store +combinations of various features of Yelp businesses, and a lot of these +combinations are totally unique to a business. Therefore the heuristics +determine that it's better to store these using hashmaps. + +.. raw:: html + + <table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th>Benchmark</th>\n <th>Map Dicts</th>\n <th>Regular Dicts</th>\n <th>% Regular Dicts</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>Censys</td>\n <td>4\u2009049\u2009235</td>\n <td>1\u2009042</td>\n <td>0.03</td>\n </tr>\n <tr>\n <td>Gharchive</td>\n <td>955\u2009301</td>\n <td>0</td>\n <td>0.00</td>\n </tr>\n <tr>\n <td>NYTimes</td>\n <td>80\u2009393</td>\n <td>0</td>\n <td>0.00</td>\n </tr>\n <tr>\n <td>Reddit</td>\n <td>1\u2009201\u2009257</td>\n <td>0</td>\n <td>0.00</td>\n </tr>\n <tr>\n <td>Rosie</td>\n <td>6\u2009248\u2009966</td>\n <td>0</td>\n <td>0.00</td>\n </tr>\n <tr>\n <td>TPCH</td>\n <td>1\u2009000\u2009000</td>\n <td>0</td>\n <td>0.00</td>\n </tr>\n <tr>\n <td>Wikidata</td>\n <td>1\u200 9923\u2009460</td>\n <td>46\u2009905</td>\n <td>2.38</td>\n </tr>\n <tr>\n <td>Yelp</td>\n <td>443\u2009140</td>\n <td>52\u2009051</td>\n <td>10.51</td>\n </tr>\n </tbody>\n</table> + + +We can also look at numbers about how often the memcmp-based speculative +parsing of the next key of a given map succeeds. Looking at statistics +about each benchmark, we can see that the speculation of what key we +expect next pays off in a significant percentage of cases, between 63% for +Wikidata where the dictionary structures are quite irregular, and 99% for +Reddit, where all the dictionaries have the same set of keys. + +.. raw:: html + + <table border="1" class="dataframe">\n <thead>\n <tr style="text-align: right;">\n <th>Benchmark</th>\n <th>Number of Keys</th>\n <th>Map Transitions</th>\n <th>% Successful Speculation</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <td>Censys</td>\n <td>14\u2009404\u2009234</td>\n <td>14\u2009403\u2009243</td>\n <td>65.79</td>\n </tr>\n <tr>\n <td>Gharchive</td>\n <td>6\u2009637\u2009881</td>\n <td>6\u2009637\u2009881</td>\n <td>86.71</td>\n </tr>\n <tr>\n <td>NYTimes</td>\n <td>417\u2009337</td>\n <td>417\u2009337</td>\n <td>79.85</td>\n </tr>\n <tr>\n <td>Reddit</td>\n <td>25\u2009226\u2009397</td>\n <td>25\u2009226\u2009397</td>\n <td>100.00</td>\n </tr>\n <tr>\n <td>Rosie</td>\n <td>28\u2009500\u2009101</td>\n <td>28\u2009500\u2009101</td>\n <td>90.37</td>\n </tr>\n <tr>\n <td>TPCH</td>\n <td>6\u2009700\u20090 00</td>\n <td>6\u2009700\u2009000</td>\n <td>86.57</td>\n </tr>\n <tr>\n <td>Wikidata</td>\n <td>6\u2009235\u2009088</td>\n <td>5\u2009267\u2009744</td>\n <td>63.68</td>\n </tr>\n <tr>\n <td>Yelp</td>\n <td>5\u2009133\u2009914</td>\n <td>4\u2009593\u2009980</td>\n <td>90.43</td>\n </tr>\n <tr>\n <td>geomean</td>\n <td></td>\n <td></td>\n <td>82.04</td>\n </tr>\n </tbody>\n</table> + +General string caching is the most unclear optimization. On the one hand its +impact on memory usage is quite substantial, leading to a 20% reduction for +Gharchive and Reddit, up to a 2× improvement for Yelp. On the other hand, the +effect on performance is less clear, since it even leads to a slowdown in +Gharchive and Reddit, and generally only a small improvement. Choosing the +right heuristic for when to disable the cache also has somewhat unclear effects +and is definitely a topic worthy of further investigation. _______________________________________________ pypy-commit mailing list [email protected] https://mail.python.org/mailman/listinfo/pypy-commit
