[ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
----------------------------------
    Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
    for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
    for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
    if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
    self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
    self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark (unless the seed is set 
manually).

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
    serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
    for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
    for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
    if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
    self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
    self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
> -------------------------------------------------------------
>
>                 Key: SPARK-3995
>                 URL: https://issues.apache.org/jira/browse/SPARK-3995
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 1.1.0
>            Reporter: Jeremy Freeman
>            Priority: Critical
>
> There is a breaking bug in PySpark's sampling methods when run with NumPy 
> v1.9. This is the version of NumPy included with the current Anaconda 
> distribution (v2.1); this is a popular distribution, and is likely to affect 
> many users.
> Steps to reproduce are:
> {code:python}
> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)
> {code}
> Returns:
> {code}
> PythonException: Traceback (most recent call last):
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", 
> line 79, in main
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 196, in dump_stream
>     self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 127, in dump_stream
>     for obj in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 185, in _batched
>     for item in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 116, in func
>     if self.getUniformSample(split) <= self._fraction:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 58, in getUniformSample
>     self.initRandomGenerator(split)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 44, in initRandomGenerator
>     self._random = numpy.random.RandomState(self._seed)
>   File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
> (numpy/random/mtrand/mtrand.c:7397)
>   File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
> (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295
> {code}
> In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:
> {code:python}
> self._seed = seed if seed is not None else random.randint(0, sys.maxint)
> {code}
> In previous versions of NumPy a random seed larger than 2 ** 32 would 
> silently get truncated to 2 ** 32. This was fixed in a recent patch 
> (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
>  But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, 
> which effectively breaks sampling operations in PySpark (unless the seed is 
> set manually).
> I am putting a PR together now (the fix is very simple!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to