document number encoding
Project: http://git-wip-us.apache.org/repos/asf/couchdb/repo Commit: http://git-wip-us.apache.org/repos/asf/couchdb/commit/bbd93f77 Tree: http://git-wip-us.apache.org/repos/asf/couchdb/tree/bbd93f77 Diff: http://git-wip-us.apache.org/repos/asf/couchdb/diff/bbd93f77 Branch: refs/heads/master Commit: bbd93f77baa4bfe1022b4fb9c9a66bdcaf9e17db Parents: eb7d91f Author: Jan Lehnardt <[email protected]> Authored: Wed Feb 20 16:49:31 2013 +0100 Committer: Jan Lehnardt <[email protected]> Committed: Wed Feb 20 17:29:35 2013 +0100 ---------------------------------------------------------------------- share/doc/src/json-structure.rst | 187 +++++++++++++++++++++++++++++++++ 1 files changed, 187 insertions(+), 0 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/couchdb/blob/bbd93f77/share/doc/src/json-structure.rst ---------------------------------------------------------------------- diff --git a/share/doc/src/json-structure.rst b/share/doc/src/json-structure.rst index aaba09e..bc4d0d2 100644 --- a/share/doc/src/json-structure.rst +++ b/share/doc/src/json-structure.rst @@ -656,3 +656,190 @@ View Head Information "total_rows": 42, "offset": 3 } + +Number Handling +=============== + +Any numbers defined in JSON that contain a decimal point or exponent +will be passed through the Erlang VM's idea of the "double" data type. +Any numbers that are used in views will pass through the views idea of +a number (the common JavaScript case means even integers pass through +a double due to JavaScript's definition of a number). + +Consider this document that we write to CouchDB: + +.. code-block:: javascript + + { + "_id":"30b3b38cdbd9e3a587de9b8122000cff", + "number": 1.1 + } + +Now letâs read that document back from CouchDB: + +.. code-block:: javascript + + { + "_id":"30b3b38cdbd9e3a587de9b8122000cff", + "_rev":"1-f065cee7c3fd93aa50f6c97acde93030", + "number":1.1000000000000000888 + } + + +What happens is CouchDB is changing the textual representation of the +result of decoding what it was given into some numerical format. In most +cases this is an `IEEE 754`_ double precision floating point number which +is exactly what almost all other languages use as well. + +.. _IEEE 754: https://en.wikipedia.org/wiki/IEEE_754-2008 + +What CouchDB does a bit differently than other languages is that it +does not attempt to pretty print the resulting output to use the +shortest number of characters. For instance, this is why we have this +relationship: + +.. code-block:: erlang + + ejson:encode(ejson:decode(<<"1.1">>)). + <<"1.1000000000000000888">> + +What can be confusing here is that internally those two formats +decode into the same IEEE-754 representation. And more importantly, it +will decode into a fairly close representation when passed through all +major parsers that I know about. + +While we've only been discussing cases where the textual +representation changes, another important case is when an input value +is contains more precision than can actually represented in a double. +(You could argue that this case is actually "losing" data if you don't +accept that numbers are stored in doubles). + +Here's a log for a couple of the more common JSON libraries I happen +to have on my machine: + +Spidermonkey:: + + $ js -h 2>&1 | head -n 1 + JavaScript-C 1.8.5 2011-03-31 + $ js + js> JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890")) + "1.0123456789012346" + js> var f = JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890")) + js> JSON.stringify(JSON.parse(f)) + "1.0123456789012346" + +Node:: + + $ node -v + v0.6.15 + $ node + JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890")) + '1.0123456789012346' + var f = JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890")) + undefined + JSON.stringify(JSON.parse(f)) + '1.0123456789012346' + +Python:: + + $ python + Python 2.7.2 (default, Jun 20 2012, 16:23:33) + [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin + Type "help", "copyright", "credits" or "license" for more information. + import json + json.dumps(json.loads("1.01234567890123456789012345678901234567890")) + '1.0123456789012346' + f = json.dumps(json.loads("1.01234567890123456789012345678901234567890")) + json.dumps(json.loads(f)) + '1.0123456789012346' + +Ruby:: + + $ irb --version + irb 0.9.5(05/04/13) + require 'JSON' + => true + JSON.dump(JSON.load("[1.01234567890123456789012345678901234567890]")) + => "[1.01234567890123]" + f = JSON.dump(JSON.load("[1.01234567890123456789012345678901234567890]")) + => "[1.01234567890123]" + JSON.dump(JSON.load(f)) + => "[1.01234567890123]" + + +.. note:: A small aside on Ruby, it requires a top level object or array, so I just + wrapped the value. Should be obvious it doesn't affect the result of + parsing the number though. + + +Ejson (CouchDB's current parser) at CouchDB sha 168a663b:: + + $ ./utils/run -i + Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2] + [async-threads:4] [hipe] [kernel-poll:true] + + Eshell V5.8.5 (abort with ^G) + 1> ejson:encode(ejson:decode(<<"1.01234567890123456789012345678901234567890">>)). + <<"1.0123456789012346135">> + 2> F = ejson:encode(ejson:decode(<<"1.01234567890123456789012345678901234567890">>)). + <<"1.0123456789012346135">> + 3> ejson:encode(ejson:decode(F)). + <<"1.0123456789012346135">> + + +As you can see they all pretty much behave the same except for Ruby +actually does appear to be losing some precision over the other +libraries. + +The astute observer will notice that ejson (the CouchDB JSON library) +reported an extra three digits. While its tempting to think that this +is due to some internal difference, its just a more specific case of +the 1.1 input as described above. + +The important point to realize here is that a double can only hold a +finite number of values. What we're doing here is generating a string +that when passed through the "standard" floating point parsing +algorithms (ie, strtod) will result in the same bit pattern in memory +as we started with. Or, slightly different, the bytes in a JSON +serialized number are chosen such that they refer to a single specific +value that a double can represent. + +The important point to understand is that we're mapping from one +infinite set onto a finite set. An easy way to see this is by +reflecting on this:: + + 1.0 == 1.00 == 1.000 = 1.(infinite zeroes) + +Obviously a computer can't hold infinite bytes so we have to +decimate our infinitely sized set to a finite set that can be +represented concisely. + +The game that other JSON libraries are playing is merely: + +"How few characters do I have to use to select this specific value for a double" + +And that game has lots and lots of subtle details that are difficult +to duplicate in C without a significant amount of effort (it took +Python over a year to get it sorted with their fancy build systems +that automatically run on a number of different architectures). + +Hopefully we've shown that CouchDB is not doing anything "funky" by +changing input. Its behaving the same as any other common JSON library +does, its just not pretty printing its output. + +On the other hand, if you actually are in a position where an IEEE-754 +double is not a satisfactory datatype for your numbers, then the +answer as has been stated is to not pass your numbers through this +representation. In JSON this is accomplished by encoding them as a +string or by using integer types (although integer types can still +bite you if you use a platform that has a different integer +representation than normal, ie, JavaScript). + +Also, if anyone is really interested in changing this behavior, I'm +all ears for contributions to `jiffy`_ (which is theoretically going to +replace ejson when I get around to updating the build system). The +places I've looked for inspiration are TCL and Python. If you know a +decent implementation of this float printing algorithm give me a +holler. + +.. _jiffy: https://github.com/davisp/jiffy
