[1/2] git commit: document number encoding

jan Wed, 20 Feb 2013 08:31:36 -0800

document number encoding


Project: http://git-wip-us.apache.org/repos/asf/couchdb/repo
Commit: http://git-wip-us.apache.org/repos/asf/couchdb/commit/bbd93f77
Tree: http://git-wip-us.apache.org/repos/asf/couchdb/tree/bbd93f77
Diff: http://git-wip-us.apache.org/repos/asf/couchdb/diff/bbd93f77

Branch: refs/heads/master
Commit: bbd93f77baa4bfe1022b4fb9c9a66bdcaf9e17db
Parents: eb7d91f
Author: Jan Lehnardt <[email protected]>
Authored: Wed Feb 20 16:49:31 2013 +0100
Committer: Jan Lehnardt <[email protected]>
Committed: Wed Feb 20 17:29:35 2013 +0100

----------------------------------------------------------------------
 share/doc/src/json-structure.rst |  187 +++++++++++++++++++++++++++++++++
 1 files changed, 187 insertions(+), 0 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/couchdb/blob/bbd93f77/share/doc/src/json-structure.rst
----------------------------------------------------------------------
diff --git a/share/doc/src/json-structure.rst b/share/doc/src/json-structure.rst
index aaba09e..bc4d0d2 100644
--- a/share/doc/src/json-structure.rst
+++ b/share/doc/src/json-structure.rst
@@ -656,3 +656,190 @@ View Head Information
         "total_rows": 42,
         "offset": 3
     }
+
+Number Handling
+===============
+
+Any numbers defined in JSON that contain a decimal point or exponent
+will be passed through the Erlang VM's idea of the "double" data type.
+Any numbers that are used in views will pass through the views idea of
+a number (the common JavaScript case means even integers pass through
+a double due to JavaScript's definition of a number).
+
+Consider this document that we write to CouchDB:
+
+.. code-block:: javascript
+
+    {
+      "_id":"30b3b38cdbd9e3a587de9b8122000cff",
+      "number": 1.1
+    }
+
+Now letâs read that document back from CouchDB:
+
+.. code-block:: javascript
+
+    {
+      "_id":"30b3b38cdbd9e3a587de9b8122000cff",
+      "_rev":"1-f065cee7c3fd93aa50f6c97acde93030",
+      "number":1.1000000000000000888
+    }
+
+
+What happens is CouchDB is changing the textual representation of the
+result of decoding what it was given into some numerical format. In most
+cases this is an `IEEE 754`_ double precision floating point number which
+is exactly what almost all other languages use as well.
+
+.. _IEEE 754: https://en.wikipedia.org/wiki/IEEE_754-2008
+
+What CouchDB does a bit differently than other languages is that it
+does not attempt to pretty print the resulting output to use the
+shortest number of characters. For instance, this is why we have this
+relationship:
+
+.. code-block:: erlang
+
+    ejson:encode(ejson:decode(<<"1.1">>)).
+    <<"1.1000000000000000888">>
+
+What can be confusing here is that internally those two formats
+decode into the same IEEE-754 representation. And more importantly, it
+will decode into a fairly close representation when passed through all
+major parsers that I know about.
+
+While we've only been discussing cases where the textual
+representation changes, another important case is when an input value
+is contains more precision than can actually represented in a double.
+(You could argue that this case is actually "losing" data if you don't
+accept that numbers are stored in doubles).
+
+Here's a log for a couple of the more common JSON libraries I happen
+to have on my machine:
+
+Spidermonkey::
+
+    $ js -h 2>&1 | head -n 1
+    JavaScript-C 1.8.5 2011-03-31
+    $ js
+    js> 
JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890"))
+    "1.0123456789012346"
+    js> var f = 
JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890"))
+    js> JSON.stringify(JSON.parse(f))
+    "1.0123456789012346"
+
+Node::
+
+    $ node -v
+    v0.6.15
+    $ node
+    JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890"))
+    '1.0123456789012346'
+    var f = 
JSON.stringify(JSON.parse("1.01234567890123456789012345678901234567890"))
+    undefined
+    JSON.stringify(JSON.parse(f))
+    '1.0123456789012346'
+
+Python::
+
+    $ python
+    Python 2.7.2 (default, Jun 20 2012, 16:23:33)
+    [GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on 
darwin
+    Type "help", "copyright", "credits" or "license" for more information.
+    import json
+    json.dumps(json.loads("1.01234567890123456789012345678901234567890"))
+    '1.0123456789012346'
+    f = json.dumps(json.loads("1.01234567890123456789012345678901234567890"))
+    json.dumps(json.loads(f))
+    '1.0123456789012346'
+
+Ruby::
+
+    $ irb --version
+    irb 0.9.5(05/04/13)
+    require 'JSON'
+    => true
+    JSON.dump(JSON.load("[1.01234567890123456789012345678901234567890]"))
+    => "[1.01234567890123]"
+    f = JSON.dump(JSON.load("[1.01234567890123456789012345678901234567890]"))
+    => "[1.01234567890123]"
+    JSON.dump(JSON.load(f))
+    => "[1.01234567890123]"
+
+
+.. note:: A small aside on Ruby, it requires a top level object or array, so I 
just
+         wrapped the value. Should be obvious it doesn't affect the result of
+         parsing the number though.
+
+
+Ejson (CouchDB's current parser) at CouchDB sha 168a663b::
+
+    $ ./utils/run -i
+    Erlang R14B04 (erts-5.8.5) [source] [64-bit] [smp:2:2] [rq:2]
+    [async-threads:4] [hipe] [kernel-poll:true]
+
+    Eshell V5.8.5  (abort with ^G)
+    1> 
ejson:encode(ejson:decode(<<"1.01234567890123456789012345678901234567890">>)).
+    <<"1.0123456789012346135">>
+    2> F = 
ejson:encode(ejson:decode(<<"1.01234567890123456789012345678901234567890">>)).
+    <<"1.0123456789012346135">>
+    3> ejson:encode(ejson:decode(F)).
+    <<"1.0123456789012346135">>
+
+
+As you can see they all pretty much behave the same except for Ruby
+actually does appear to be losing some precision over the other
+libraries.
+
+The astute observer will notice that ejson (the CouchDB JSON library)
+reported an extra three digits. While its tempting to think that this
+is due to some internal difference, its just a more specific case of
+the 1.1 input as described above.
+
+The important point to realize here is that a double can only hold a
+finite number of values. What we're doing here is generating a string
+that when passed through the "standard" floating point parsing
+algorithms (ie, strtod) will result in the same bit pattern in memory
+as we started with. Or, slightly different, the bytes in a JSON
+serialized number are chosen such that they refer to a single specific
+value that a double can represent.
+
+The important point to understand is that we're mapping from one
+infinite set onto a finite set. An easy way to see this is by
+reflecting on this::
+
+    1.0 == 1.00 == 1.000 = 1.(infinite zeroes)
+
+Obviously a computer can't hold infinite bytes so we have to
+decimate our infinitely sized set to a finite set that can be
+represented concisely.
+
+The game that other JSON libraries are playing is merely:
+
+"How few characters do I have to use to select this specific value for a 
double"
+
+And that game has lots and lots of subtle details that are difficult
+to duplicate in C without a significant amount of effort (it took
+Python over a year to get it sorted with their fancy build systems
+that automatically run on a number of different architectures).
+
+Hopefully we've shown that CouchDB is not doing anything "funky" by
+changing input. Its behaving the same as any other common JSON library
+does, its just not pretty printing its output.
+
+On the other hand, if you actually are in a position where an IEEE-754
+double is not a satisfactory datatype for your numbers, then the
+answer as has been stated is to not pass your numbers through this
+representation. In JSON this is accomplished by encoding them as a
+string or by using integer types (although integer types can still
+bite you if you use a platform that has a different integer
+representation than normal, ie, JavaScript).
+
+Also, if anyone is really interested in changing this behavior, I'm
+all ears for contributions to `jiffy`_ (which is theoretically going to
+replace ejson when I get around to updating the build system). The
+places I've looked for inspiration are TCL and Python. If you know a
+decent implementation of this float printing algorithm give me a
+holler.
+
+.. _jiffy: https://github.com/davisp/jiffy

[1/2] git commit: document number encoding

Reply via email to