Ben Smith created SPARK-32522:
---------------------------------

             Summary: Using pyspark with a MultiLayerPerceptron model given 
inconsistent outputs if a large amount of data is fed into it and at least one 
of the model outputs is fed to a Python UDF.
                 Key: SPARK-32522
                 URL: https://issues.apache.org/jira/browse/SPARK-32522
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.3, 3.1.0
         Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3

or

CentOS 7.6 with Python 3.6.3 and Spark built from master
            Reporter: Ben Smith
         Attachments: model.zip

Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a 
large amount of data is fed into it and at least one of the model outputs is 
fed to a Python UDF.

This data correctness issue impacts both the Spark 2.4 releases and the latest 
Master branch.

I do not understand the root cause and cannot recreate 100% of the time. But I 
have a simplified code sample (attached) that triggers the bug regularly. I 
raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a 
suggested root cause and I have since recreated the problem on master so I am 
now raising a bug here.

During debugging I have narrowed the problem down somewhat and some 
observations I have made while doing this are:
 * I can recreate the problem with a very simple MultilayerPerceptron with no 
hidden layers (just 14 features and 2 outputs), I also see it with a more 
complex MultilayerPerceptron model so I don't think the model details are 
important.
 * I cannot recreate the problem unless the model output is fed to a python 
UDF, removing this leads to good outputs for the model and having it means that 
model outputs are inconsistent (note that not just the Python UDF outputs are 
inconsistent)
 * I cannot recreate the problem on minuscule amounts of data or when my data 
is partitioned heavily. 100,000 rows of input with 2 partitions sees the issue 
happen most of the time.
 * Some of the bad outputs I get could be explained if certain features were 
zero when they came into the model (when they are not in my actual feature data)
 * I can recreate the problem on several different servers

My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also 
recreate the issue from the code on the Spark master branch but strangely I 
cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why the 
version of python would matter.

The attached code sample triggers the problem for me the vast majority of the 
time when pasted into a pyspark shell. This code generates a dataframe 
containing 100,000 identical rows, transforms it with a MultiLayerPerceptron 
model and feeds one of the model output columns to a simple Python UDF to 
generate an additional column. The resulting dataframe has the distinct rows 
selected and since all the inputs are identical I would expect to get 1 row 
back, instead I get many unique rows with the number returned varying each time 
I run the code. To run the code you will need the model files locally. I have 
attached the model as a zip archive and unzipping this to /tmp should be all 
you need to do to get the code to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to