[ 
https://issues.apache.org/jira/browse/SPARK-18667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15821506#comment-15821506
 ] 

Ben edited comment on SPARK-18667 at 1/13/17 9:53 AM:
------------------------------------------------------

So, I created a new example now, and here is the code for everything:

a.xml:
{noformat}
<root>
  <x>TEXT</x>
  <y>TEXT2</y>
</root>
{noformat}

b.xml:
{noformat}
<root>
  <file>file:/C:/a.xml</file>
  <other>AAA</other>
</root>
{noformat}

code:
{noformat}
from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
    return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText',filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('../../res/Other/a.xml', 
rowTag='root').select('*',input_file_name().alias('file'))
df.select('file').show()
df.select(sameText(df['file'])).show()

df2 = session.read.format('xml').load('../../res/Other/b.xml', rowTag='root')
df3 = df.join(df2, 'file')

df.show()
df2.show()
df3.show()
df3.selectExpr('file as FILE','x AS COL1','sameText(y) AS COL2').show()
{noformat}

and this is the console output:
{noformat}
+--------------------+
|                file|
+--------------------+
|file:/C:/Users/SS...|
+--------------------+

+--------------+
|filename(file)|
+--------------+
|              |
+--------------+

+----+-----+--------------------+
|   x|    y|                file|
+----+-----+--------------------+
|TEXT|TEXT2|file:/C:/Users/SS...|
+----+-----+--------------------+

+--------------------+-----+
|                file|other|
+--------------------+-----+
|file:/C:/Users/SS...|  AAA|
+--------------------+-----+

+--------------------+----+-----+-----+
|                file|   x|    y|other|
+--------------------+----+-----+-----+
|file:/C:/Users/SS...|TEXT|TEXT2|  AAA|
+--------------------+----+-----+-----+


[Stage 26:>                                                         (0 + 4) / 4]
                                                                                

[Stage 29:>                                                        (0 + 8) / 20]
[Stage 29:=================>                                       (6 + 8) / 20]
[Stage 29:===================>                                     (7 + 8) / 20]
[Stage 29:======================>                                  (8 + 8) / 20]
[Stage 29:============================>                           (10 + 8) / 20]
[Stage 29:====================================>                   (13 + 7) / 20]
[Stage 29:=======================================>                (14 + 6) / 20]
[Stage 29:==========================================>             (15 + 5) / 20]
                                                                                

[Stage 32:>                                                       (0 + 8) / 100]
[Stage 32:===>                                                    (7 + 8) / 100]
[Stage 32:====>                                                   (8 + 8) / 100]
[Stage 32:=======>                                               (13 + 8) / 100]
[Stage 32:========>                                              (15 + 8) / 100]
[Stage 32:===========>                                           (20 + 8) / 100]
[Stage 32:============>                                          (22 + 8) / 100]
[Stage 32:==============>                                        (27 + 8) / 100]
[Stage 32:===============>                                       (29 + 8) / 100]
[Stage 32:==================>                                    (34 + 8) / 100]
[Stage 32:===================>                                   (36 + 8) / 100]
[Stage 32:======================>                                (41 + 8) / 100]
[Stage 32:=======================>                               (42 + 8) / 100]
[Stage 32:=========================>                             (46 + 8) / 100]
[Stage 32:==========================>                            (48 + 8) / 100]
[Stage 32:==========================>                            (49 + 8) / 100]
[Stage 32:===========================>                           (50 + 8) / 100]
[Stage 32:=============================>                         (53 + 8) / 100]
[Stage 32:==============================>                        (55 + 8) / 100]
[Stage 32:==============================>                        (56 + 8) / 100]
[Stage 32:===============================>                       (57 + 8) / 100]
[Stage 32:=================================>                     (60 + 8) / 100]
[Stage 32:==================================>                    (62 + 8) / 100]
[Stage 32:==================================>                    (63 + 8) / 100]
[Stage 32:===================================>                   (65 + 8) / 100]
[Stage 32:====================================>                  (67 + 8) / 100]
[Stage 32:=====================================>                 (69 + 8) / 100]
[Stage 32:======================================>                (70 + 8) / 100]
[Stage 32:=======================================>               (72 + 8) / 100]
[Stage 32:========================================>              (74 + 8) / 100]
[Stage 32:=========================================>             (76 + 8) / 100]
[Stage 32:==========================================>            (77 + 8) / 100]
[Stage 32:===========================================>           (79 + 8) / 100]
[Stage 32:============================================>          (81 + 8) / 100]
[Stage 32:=============================================>         (83 + 8) / 100]
[Stage 32:==============================================>        (84 + 8) / 100]
[Stage 32:===============================================>       (86 + 8) / 100]
[Stage 32:================================================>      (88 + 8) / 100]
[Stage 32:=================================================>     (90 + 8) / 100]
[Stage 32:==================================================>    (91 + 8) / 100]
[Stage 32:===================================================>   (93 + 7) / 100]
[Stage 32:====================================================>  (95 + 5) / 100]
[Stage 32:=====================================================> (97 + 3) / 100]
                                                                                

[Stage 35:>                                                        (0 + 8) / 75]
[Stage 35:==>                                                     (4 + 11) / 75]
[Stage 35:=====>                                                   (7 + 8) / 75]
[Stage 35:======>                                                  (8 + 8) / 75]
[Stage 35:=========>                                              (13 + 8) / 75]
[Stage 35:==========>                                             (14 + 8) / 75]
[Stage 35:===========>                                            (15 + 8) / 75]
[Stage 35:===========>                                            (16 + 8) / 75]
[Stage 35:==============>                                         (20 + 8) / 75]
[Stage 35:===============>                                        (21 + 8) / 75]
[Stage 35:================>                                       (22 + 8) / 75]
[Stage 35:=================>                                      (23 + 8) / 75]
[Stage 35:====================>                                   (27 + 8) / 75]
[Stage 35:====================>                                   (28 + 8) / 75]
[Stage 35:=====================>                                  (29 + 8) / 75]
[Stage 35:======================>                                 (30 + 8) / 75]
[Stage 35:=========================>                              (34 + 8) / 75]
[Stage 35:==========================>                             (35 + 8) / 75]
[Stage 35:==========================>                             (36 + 8) / 75]
[Stage 35:===========================>                            (37 + 8) / 75]
[Stage 35:==============================>                         (41 + 8) / 75]
[Stage 35:===============================>                        (42 + 8) / 75]
[Stage 35:================================>                       (43 + 8) / 75]
[Stage 35:================================>                       (44 + 8) / 75]
[Stage 35:===================================>                    (48 + 8) / 75]
[Stage 35:====================================>                   (49 + 8) / 75]
[Stage 35:=====================================>                  (50 + 8) / 75]
[Stage 35:======================================>                 (51 + 8) / 75]
[Stage 35:=========================================>              (55 + 8) / 75]
[Stage 35:=========================================>              (56 + 8) / 75]
[Stage 35:==========================================>             (57 + 8) / 75]
[Stage 35:===========================================>            (58 + 8) / 75]
[Stage 35:==============================================>         (62 + 8) / 75]
[Stage 35:===============================================>        (63 + 8) / 75]
[Stage 35:================================================>       (65 + 8) / 75]
[Stage 35:===================================================>    (69 + 6) / 75]
[Stage 35:=====================================================>  (72 + 3) / 75]
+--------------------+----+-----+
|                FILE|COL1| COL2|
+--------------------+----+-----+
|file:/C:/Users/SS...|TEXT|TEXT2|
+--------------------+----+-----+

SUCCESS: The process with PID 11916 (child process of PID 3592) has been 
terminated.
SUCCESS: The process with PID 3592 (child process of PID 9904) has been 
terminated.
SUCCESS: The process with PID 9904 (child process of PID 5468) has been 
terminated.
{noformat}

As you can see, After I add the "file" column, and show it, it's working. But 
then if I apply an UDF on it, it returns empty.
Then I proceed with the join, and show each dataframe again. Everything is OK 
until the last row, where it takes a very long time considering the amount of 
data and the speed of the previous processes. And this worries me becauses I 
don't think it is supposed to take such a long time.
Additionally, although in this example it works in the end, in my actual code 
it is not working. So as I already wrote on the previous post, if I join two 
dataframes, everything is OK, until I do a select where I apply a UDF, and then 
the whole query returns empty. The UDF works well if I don't join.
The other thing is, that if I do a count on the dataframe after the select just 
mentioned, it returns the correct number of rows, and not 0, but if I try to 
show or write the rows, the dataframe comes up empty. I'm not sure why this is 
happening so I would appreciate any help, and to at least know whether it's a 
bug or not.


was (Author: someonehere15):
So, I created a new example now, and here is the code for everything:

a.xml:
{noformat}
<root>
  <x>TEXT</x>
  <y>TEXT2</y>
</root>
{noformat}

b.xml:
{noformat}
<root>
  <file>file:/C:/a.xml</file>
  <other>AAA</other>
</root>
{noformat}

code:
{noformat}
from pyspark.sql.functions import udf,input_file_name
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession

def filename(path):
    return path

session = SparkSession.builder.appName('APP').getOrCreate()

session.udf.register('sameText',filename)
sameText = udf(filename, StringType())

df = session.read.format('xml').load('../../res/Other/a.xml', 
rowTag='root').select('*',input_file_name().alias('file'))
df.select('file').show()
df.select(sameText(df['file'])).show()

df2 = session.read.format('xml').load('../../res/Other/b.xml', rowTag='root')
df3 = df.join(df2, 'file')

df.show()
df2.show()
df3.show()
df3.selectExpr('file as FILE','x AS COL1','sameText(y) AS COL2').show()
{noformat}

and this is the console output:
{noformat}
2017-01-13 10:27:55 WARN   org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
+--------------------+
|                file|
+--------------------+
|file:/C:/Users/SS...|
+--------------------+

+--------------+
|filename(file)|
+--------------+
|              |
+--------------+

+----+-----+--------------------+
|   x|    y|                file|
+----+-----+--------------------+
|TEXT|TEXT2|file:/C:/Users/SS...|
+----+-----+--------------------+

+--------------------+-----+
|                file|other|
+--------------------+-----+
|file:/C:/Users/SS...|  AAA|
+--------------------+-----+

+--------------------+----+-----+-----+
|                file|   x|    y|other|
+--------------------+----+-----+-----+
|file:/C:/Users/SS...|TEXT|TEXT2|  AAA|
+--------------------+----+-----+-----+


[Stage 26:>                                                         (0 + 4) / 4]
                                                                                

[Stage 29:>                                                        (0 + 8) / 20]
[Stage 29:=================>                                       (6 + 8) / 20]
[Stage 29:===================>                                     (7 + 8) / 20]
[Stage 29:======================>                                  (8 + 8) / 20]
[Stage 29:============================>                           (10 + 8) / 20]
[Stage 29:====================================>                   (13 + 7) / 20]
[Stage 29:=======================================>                (14 + 6) / 20]
[Stage 29:==========================================>             (15 + 5) / 20]
                                                                                

[Stage 32:>                                                       (0 + 8) / 100]
[Stage 32:===>                                                    (7 + 8) / 100]
[Stage 32:====>                                                   (8 + 8) / 100]
[Stage 32:=======>                                               (13 + 8) / 100]
[Stage 32:========>                                              (15 + 8) / 100]
[Stage 32:===========>                                           (20 + 8) / 100]
[Stage 32:============>                                          (22 + 8) / 100]
[Stage 32:==============>                                        (27 + 8) / 100]
[Stage 32:===============>                                       (29 + 8) / 100]
[Stage 32:==================>                                    (34 + 8) / 100]
[Stage 32:===================>                                   (36 + 8) / 100]
[Stage 32:======================>                                (41 + 8) / 100]
[Stage 32:=======================>                               (42 + 8) / 100]
[Stage 32:=========================>                             (46 + 8) / 100]
[Stage 32:==========================>                            (48 + 8) / 100]
[Stage 32:==========================>                            (49 + 8) / 100]
[Stage 32:===========================>                           (50 + 8) / 100]
[Stage 32:=============================>                         (53 + 8) / 100]
[Stage 32:==============================>                        (55 + 8) / 100]
[Stage 32:==============================>                        (56 + 8) / 100]
[Stage 32:===============================>                       (57 + 8) / 100]
[Stage 32:=================================>                     (60 + 8) / 100]
[Stage 32:==================================>                    (62 + 8) / 100]
[Stage 32:==================================>                    (63 + 8) / 100]
[Stage 32:===================================>                   (65 + 8) / 100]
[Stage 32:====================================>                  (67 + 8) / 100]
[Stage 32:=====================================>                 (69 + 8) / 100]
[Stage 32:======================================>                (70 + 8) / 100]
[Stage 32:=======================================>               (72 + 8) / 100]
[Stage 32:========================================>              (74 + 8) / 100]
[Stage 32:=========================================>             (76 + 8) / 100]
[Stage 32:==========================================>            (77 + 8) / 100]
[Stage 32:===========================================>           (79 + 8) / 100]
[Stage 32:============================================>          (81 + 8) / 100]
[Stage 32:=============================================>         (83 + 8) / 100]
[Stage 32:==============================================>        (84 + 8) / 100]
[Stage 32:===============================================>       (86 + 8) / 100]
[Stage 32:================================================>      (88 + 8) / 100]
[Stage 32:=================================================>     (90 + 8) / 100]
[Stage 32:==================================================>    (91 + 8) / 100]
[Stage 32:===================================================>   (93 + 7) / 100]
[Stage 32:====================================================>  (95 + 5) / 100]
[Stage 32:=====================================================> (97 + 3) / 100]
                                                                                

[Stage 35:>                                                        (0 + 8) / 75]
[Stage 35:==>                                                     (4 + 11) / 75]
[Stage 35:=====>                                                   (7 + 8) / 75]
[Stage 35:======>                                                  (8 + 8) / 75]
[Stage 35:=========>                                              (13 + 8) / 75]
[Stage 35:==========>                                             (14 + 8) / 75]
[Stage 35:===========>                                            (15 + 8) / 75]
[Stage 35:===========>                                            (16 + 8) / 75]
[Stage 35:==============>                                         (20 + 8) / 75]
[Stage 35:===============>                                        (21 + 8) / 75]
[Stage 35:================>                                       (22 + 8) / 75]
[Stage 35:=================>                                      (23 + 8) / 75]
[Stage 35:====================>                                   (27 + 8) / 75]
[Stage 35:====================>                                   (28 + 8) / 75]
[Stage 35:=====================>                                  (29 + 8) / 75]
[Stage 35:======================>                                 (30 + 8) / 75]
[Stage 35:=========================>                              (34 + 8) / 75]
[Stage 35:==========================>                             (35 + 8) / 75]
[Stage 35:==========================>                             (36 + 8) / 75]
[Stage 35:===========================>                            (37 + 8) / 75]
[Stage 35:==============================>                         (41 + 8) / 75]
[Stage 35:===============================>                        (42 + 8) / 75]
[Stage 35:================================>                       (43 + 8) / 75]
[Stage 35:================================>                       (44 + 8) / 75]
[Stage 35:===================================>                    (48 + 8) / 75]
[Stage 35:====================================>                   (49 + 8) / 75]
[Stage 35:=====================================>                  (50 + 8) / 75]
[Stage 35:======================================>                 (51 + 8) / 75]
[Stage 35:=========================================>              (55 + 8) / 75]
[Stage 35:=========================================>              (56 + 8) / 75]
[Stage 35:==========================================>             (57 + 8) / 75]
[Stage 35:===========================================>            (58 + 8) / 75]
[Stage 35:==============================================>         (62 + 8) / 75]
[Stage 35:===============================================>        (63 + 8) / 75]
[Stage 35:================================================>       (65 + 8) / 75]
[Stage 35:===================================================>    (69 + 6) / 75]
[Stage 35:=====================================================>  (72 + 3) / 75]
+--------------------+----+-----+
|                FILE|COL1| COL2|
+--------------------+----+-----+
|file:/C:/Users/SS...|TEXT|TEXT2|
+--------------------+----+-----+

SUCCESS: The process with PID 11916 (child process of PID 3592) has been 
terminated.
SUCCESS: The process with PID 3592 (child process of PID 9904) has been 
terminated.
SUCCESS: The process with PID 9904 (child process of PID 5468) has been 
terminated.
{noformat}

As you can see, After I add the "file" column, and show it, it's working. But 
then if I apply an UDF on it, it returns empty.
Then I proceed with the join, and show each dataframe again. Everything is OK 
until the last row, where it takes a very long time considering the amount of 
data and the speed of the previous processes. And this worries me becauses I 
don't think it is supposed to take such a long time.
Additionally, although in this example it works in the end, in my actual code 
it is not working. So as I already wrote on the previous post, if I join two 
dataframes, everything is OK, until I do a select where I apply a UDF, and then 
the whole query returns empty. The UDF works well if I don't join.
The other thing is, that if I do a count on the dataframe after the select just 
mentioned, it returns the correct number of rows, and not 0, but if I try to 
show or write the rows, the dataframe comes up empty. I'm not sure why this is 
happening so I would appreciate any help, and to at least know whether it's a 
bug or not.

> input_file_name function does not work with UDF
> -----------------------------------------------
>
>                 Key: SPARK-18667
>                 URL: https://issues.apache.org/jira/browse/SPARK-18667
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>            Reporter: Hyukjin Kwon
>            Assignee: Liang-Chi Hsieh
>             Fix For: 2.1.0
>
>
> {{input_file_name()}} does not return the file name but empty string instead 
> when it is used as input for UDF in PySpark as below: 
> with the data as below:
> {code}
> {"a": 1}
> {code}
> with the codes below:
> {code}
> from pyspark.sql.functions import *
> from pyspark.sql.types import *
> def filename(path):
>     return path
> sourceFile = udf(filename, StringType())
> spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
> {code}
> prints as below:
> {code}
> +---------------------------+
> |filename(input_file_name())|
> +---------------------------+
> |                           |
> +---------------------------+
> {code}
> but the codes below:
> {code}
> spark.read.json("tmp.json").select(input_file_name()).show()
> {code}
> prints correctly as below:
> {code}
> +--------------------+
> |   input_file_name()|
> +--------------------+
> |file:///Users/hyu...|
> +--------------------+
> {code}
> This seems PySpark specific issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to