[
https://issues.apache.org/jira/browse/SPARK-32522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ben Smith updated SPARK-32522:
------------------------------
Attachment: model.zip
> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if
> a large amount of data is fed into it and at least one of the model outputs
> is fed to a Python UDF.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-32522
> URL: https://issues.apache.org/jira/browse/SPARK-32522
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 2.4.3, 3.1.0
> Environment: CentOS 7.6 with Python 3.6.3 and Spark 2.4.3
> or
> CentOS 7.6 with Python 3.6.3 and Spark built from master
> Reporter: Ben Smith
> Priority: Major
> Labels: correctness
> Attachments: model.zip
>
>
> Using pyspark with a MultiLayerPerceptron model given inconsistent outputs if
> a large amount of data is fed into it and at least one of the model outputs
> is fed to a Python UDF.
> This data correctness issue impacts both the Spark 2.4 releases and the
> latest Master branch.
> I do not understand the root cause and cannot recreate 100% of the time. But
> I have a simplified code sample (attached) that triggers the bug regularly. I
> raised an inquiry on the mailing list as a Spark 2.4 issue but nobody had a
> suggested root cause and I have since recreated the problem on master so I am
> now raising a bug here.
> During debugging I have narrowed the problem down somewhat and some
> observations I have made while doing this are:
> * I can recreate the problem with a very simple MultilayerPerceptron with no
> hidden layers (just 14 features and 2 outputs), I also see it with a more
> complex MultilayerPerceptron model so I don't think the model details are
> important.
> * I cannot recreate the problem unless the model output is fed to a python
> UDF, removing this leads to good outputs for the model and having it means
> that model outputs are inconsistent (note that not just the Python UDF
> outputs are inconsistent)
> * I cannot recreate the problem on minuscule amounts of data or when my data
> is partitioned heavily. 100,000 rows of input with 2 partitions sees the
> issue happen most of the time.
> * Some of the bad outputs I get could be explained if certain features were
> zero when they came into the model (when they are not in my actual feature
> data)
> * I can recreate the problem on several different servers
> My environment is CentOS 7.6 with Python 3.6.3 and Spark 2.4.3, I can also
> recreate the issue from the code on the Spark master branch but strangely I
> cannot recreate the issue with Spark 2.4.3 and Python 2.7. I'm not sure why
> the version of python would matter.
> The attached code sample triggers the problem for me the vast majority of the
> time when pasted into a pyspark shell. This code generates a dataframe
> containing 100,000 identical rows, transforms it with a MultiLayerPerceptron
> model and feeds one of the model output columns to a simple Python UDF to
> generate an additional column. The resulting dataframe has the distinct rows
> selected and since all the inputs are identical I would expect to get 1 row
> back, instead I get many unique rows with the number returned varying each
> time I run the code. To run the code you will need the model files locally. I
> have attached the model as a zip archive and unzipping this to /tmp should be
> all you need to do to get the code to run.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]