[GitHub] spark pull request: [SPARK-15282][SQL] PushDownPredicate should no...

linbojin Sun, 22 May 2016 08:19:34 -0700

Github user linbojin commented on the pull request:

    https://github.com/apache/spark/pull/13087#issuecomment-220837854
  
    @dongjoon-hyun @cloud-fan @marmbrus Thanks for your discussions about my 
reported issue: [SPARK-15282](https://issues.apache.org/jira/browse/SPARK-15282)
    
    Maybe I should describe more about our use cases. In our project, firstly 
we generated a dataframe with one column called "fileName" one column called 
"url", and then we use a udf function (used inside withColumn()) to download 
the files from the corresponding urls and filter out '{}' data before writing 
to hdfs:
    
    ```scala
    // df: DataFrame["fileName", "url"] 
    val getDataUDF = udf((url: String) => {
        try { 
           download data
        } catch { case e: Exception =>
          "{}"
        }
      })
    val df2 = df.withColumn("data", getDataUDF(df("url")))
                .filter("data <> '{}'")
    df2.write.save("hdfs path")
    ```
    
    Based on our logs, each file will be downloaded twice. As for the running 
time, the writing job with filter will be twice as without filter: 
    ![screen shot 2016-05-22 at 22 19 
24](https://cloud.githubusercontent.com/assets/5894707/15454461/c13c9918-206b-11e6-8901-1f473fbae3ca.png)
    ![screen shot 2016-05-22 at 22 18 
02](https://cloud.githubusercontent.com/assets/5894707/15454462/c8d4c1a0-206b-11e6-88cc-8ad91d121b6e.png)
    Left is with `.filter("data <> '{}'")` and right is without `.filter("data 
<> '{}'")`. It can be imaged, when there are many urls or the files are very 
large, the reported issue will affect the performance a lot.
    
    Another problem is about data correctness. Because it's downloaded twice 
for each file, we came across some cases that the first downloading 
(getDataUDF) can get data (not '{}'), and the second downloading return '{}' 
because of certain connection exception. But i found the filter only worked on 
the first returned value so that spark will not remove this row but the value 
inside "data" column was '{}' which is the second returned value. Even after 
filter, we get the result dataframe df2 like the follows (files with '{}' data 
which should be removed):
    ```
    fileName     url          data
    file1        url1         sth     
    files        url2         `{}`
    ```
    
    **So on the high level, we get '{}' data after filter out '{}' which is 
strange. The reason I think is that UDF function is executed twice when filter 
on new column created by withColumn, and two returned values are different: 
first one makes filter condition true and second one makes filter condition 
false. The dataframe will keep the second value which in fact should not appear 
after filter operation.** 
    
    Finally, i removed the filter operation (filter out '{}'  in downstream) 
because i think it may be not correct to  filter on new column created by 
withColumn. For me, i agree with @cloud-fan and @thunterdb, we can just 
document this behavior of udfs and uses should avoid to use udfs in such way.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-15282][SQL] PushDownPredicate should no...

Reply via email to