Ben Smith created SPARK-32522:
---------------------------------
Summary: Using pyspark with a MultiLayerPerceptron model given
inconsistent outputs if a large amount of data is fed into it and at least one
of the model outputs is fed to a Python UDF.
Key: SPARK-32522
URL: https://issues.apache.org/jira/browse/SPARK-32522
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 2.4.3, 3.1.0
Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3
or
CentOS 7.6 with Python 3.6.3 and Spark built from master
Reporter: Ben Smith
Attachments: model.zip
Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if a
large amount of data is fed into it and at least one of the model outputs is
fed to a Python UDF.
This data correctness issue impacts both the Spark 2.4 releases and the latest
Master branch.
I do not understand the root cause and cannot recreate 100% of the time. But I
have a simplified code sample (attached) that triggers the bug regularly. I
raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a
suggested root cause and I have since recreated the problem on master so I am
now raising a bug here.
During debugging I have narrowed the problem down somewhat and some
observations I have made while doing this are:
* I can recreate the problem with a very simple MultilayerPerceptron with no
hidden layers (just 14 features and 2 outputs), I also see it with a more
complex MultilayerPerceptron model so I don't think the model details are
important.
* I cannot recreate the problem unless the model output is fed to a python
UDF, removing this leads to good outputs for the model and having it means that
model outputs are inconsistent (note that not just the Python UDF outputs are
inconsistent)
* I cannot recreate the problem on minuscule amounts of data or when my data
is partitioned heavily. 100,000 rows of input with 2 partitions sees the issue
happen most of the time.
* Some of the bad outputs I get could be explained if certain features were
zero when they came into the model (when they are not in my actual feature data)
* I can recreate the problem on several different servers
My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also
recreate the issue from the code on the Spark master branch but strangely I
cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why the
version of python would matter.
The attached code sample triggers the problem for me the vast majority of the
time when pasted into a pyspark shell. This code generates a dataframe
containing 100,000 identical rows, transforms it with a MultiLayerPerceptron
model and feeds one of the model output columns to a simple Python UDF to
generate an additional column. The resulting dataframe has the distinct rows
selected and since all the inputs are identical I would expect to get 1 row
back, instead I get many unique rows with the number returned varying each time
I run the code. To run the code you will need the model files locally. I have
attached the model as a zip archive and unzipping this to /tmp should be all
you need to do to get the code to run.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]