Bug#820553: RFP: blockstack-server -- A server that handles the core functionality of building the global Internet database

2016-04-09 Thread Carlo Stemberger
Package: wnpp
Severity: wishlist

* Package name: blockstack-server
  Version : Git
  Upstream Author : Muneeb Ali , Jude Nelson 
* URL : https://github.com/blockstack/blockstack-server
* License : GPL
  Programming Lang: Python
  Description : A server that handles the core functionality of building 
the global Internet database

Blockstack server provides decentralized DNS by using an underlying
blockchain. It enables human-readable name registrations on the Bitcoin
blockchain, along with the ability to store associated data in external
datastores. You can use it to register globally unique names, bind data
records with those names, and transfer them between Bitcoin addresses.
Anyone can perform lookups on those names and securely obtain the
associated data records.

Blockstack uses the Bitcoin blockchain for storing name operations and
data hashes, and the Kademlia-based distributed hash table (DHT) and
other external datastores for storing the full data files outside of the
blockchain.



Re: CPython hash randomization makes some Python packages unreproducible

2016-04-09 Thread Julien Cristau
On Sat, Apr  9, 2016 at 13:25:39 -0400, Cara wrote:

> I think a better solution is disabling hash randomization by setting
> PYTHONHASHSEED=0 when building Python packages with CPython for Debian,
> probably somewhere in dh-python.  Note that this isn't necessary for
> PyPy, which doesn't have hash randomization[7].  Hash randomization was
> implemented to prevent, "[H]ash collisions [being] exploited to DoS a
> web framework that automatically parses input forms into
> dictionaries"[8].  This shouldn't be an issue at build-time, as any
> time CPython is run to read in the files written during the build, hash
> randomization will be enabled again.
> 
FWIW I think that's a bad idea.  A number of packages run their test
suite at build time, and running the tests with hash randomization
enabled seems to me like something we shouldn't give up.  Couldn't
packages where the binary packages contents depend on the hash seed just
set one themselves?

Cheers,
Julien



CPython hash randomization makes some Python packages unreproducible

2016-04-09 Thread Cara
I've been investigating why some Python packages are unreproducible[1]
and have discovered that in some cases the problem can be traced to
CPython's hash randomization.  This happens any time a package writes
files that depend on the iteration order over dictionaries or sets. An
example is python-phply[2], which depends on PLY[3], an LALR parser for
Python.  After being given a grammar, PLY generates LALR parse tables
and writes these tables to a file to avoid needing to regenerate them,
and in generating the file, PLY iterates over dict.items()[4].  This
problem has also occurred in other contexts, for instance Sphinx had a
reproducibility issue[5] that related to hash randomization.  Another
example is pickle: running the following script under CPython will
generate different pickle files with different values of PYTHONHASHSEED
because the order in which a dictionary is created affects its pickle.

import pickle

d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}
pickle.dump(dict(d.items()), open('temp.pickle', 'wb'))

There's often no simple solution for these problems at the level of the
packages themselves.  In PLY's case, trying to sort the parse tables
before writing them to file doesn't work because of how it iterates
over dictionaries during table generation[6].  I doubt that the other
proposed solution in that Github issue, using an ordered dictionary,
will be accepted by David Beaz because it would cause a significant
performance hit on CPython <3.4, particularly CPython 2.7, because a C
implementation of ordered dictionaries was only added in 3.5.  More
broadly, trying to patch every individual Python package that's
affected is impractical, both because of the number of affected
packages and the possibility that any individual patch can be quite
complicated if it's even possible.

I think a better solution is disabling hash randomization by setting
PYTHONHASHSEED=0 when building Python packages with CPython for Debian,
probably somewhere in dh-python.  Note that this isn't necessary for
PyPy, which doesn't have hash randomization[7].  Hash randomization was
implemented to prevent, "[H]ash collisions [being] exploited to DoS a
web framework that automatically parses input forms into
dictionaries"[8].  This shouldn't be an issue at build-time, as any
time CPython is run to read in the files written during the build, hash
randomization will be enabled again.

Ceridwen

[1] https://wiki.debian.org/ReproducibleBuilds
[2] https://packages.debian.org/stretch/python/python-phply
[3] https://github.com/dabeaz/ply
[4] https://github.com/dabeaz/ply/blob/master/ply/yacc.py#L2733
[5] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=795976;msg=29
[6] https://github.com/dabeaz/ply/issues/79
[7] http://doc.pypy.org/en/latest/cpython_differences.html#miscellaneous
[8] https://bugs.python.org/issue13703