Github user linbojin commented on the pull request:
https://github.com/apache/spark/pull/13087#issuecomment-220837854
@dongjoon-hyun @cloud-fan @marmbrus Thanks for your discussions about my
reported issue: [SPARK-15282](https://issues.apache.org/jira/browse/SPARK-15282)
Maybe I should describe more about our use cases. In our project, firstly
we generated a dataframe with one column called "fileName" one column called
"url", and then we use a udf function (used inside withColumn()) to download
the files from the corresponding urls and filter out '{}' data before writing
to hdfs:
```scala
// df: DataFrame["fileName", "url"]
val getDataUDF = udf((url: String) => {
try {
download data
} catch { case e: Exception =>
"{}"
}
})
val df2 = df.withColumn("data", getDataUDF(df("url")))
.filter("data <> '{}'")
df2.write.save("hdfs path")
```
Based on our logs, each file will be downloaded twice. As for the running
time, the writing job with filter will be twice as without filter:


Left is with `.filter("data <> '{}'")` and right is without `.filter("data
<> '{}'")`. It can be imaged, when there are many urls or the files are very
large, the reported issue will affect the performance a lot.
Another problem is about data correctness. Because it's downloaded twice
for each file, we came across some cases that the first downloading
(getDataUDF) can get data (not '{}'), and the second downloading return '{}'
because of certain connection exception. But i found the filter only worked on
the first returned value so that spark will not remove this row but the value
inside "data" column was '{}' which is the second returned value. Even after
filter, we get the result dataframe df2 like the follows (files with '{}' data
which should be removed):
```
fileName url data
file1 url1 sth
files url2 `{}`
```
**So on the high level, we get '{}' data after filter out '{}' which is
strange. The reason I think is that UDF function is executed twice when filter
on new column created by withColumn, and two returned values are different:
first one makes filter condition true and second one makes filter condition
false. The dataframe will keep the second value which in fact should not appear
after filter operation.**
Finally, i removed the filter operation (filter out '{}' in downstream)
because i think it may be not correct to filter on new column created by
withColumn. For me, i agree with @cloud-fan and @thunterdb, we can just
document this behavior of udfs and uses should avoid to use udfs in such way.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]