[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430278#comment-15430278
 ] 

Sean Owen commented on SPARK-17143:
---

This sounds like an HDFS environment problem then. This dir would exist and be 
writable to all users when HDFS's file system is created.

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> ​
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> ​
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> ​
> print("spark version: {}".format(sc.version))
> ​
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062 answer = 

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427394#comment-15427394
 ] 

Andrew Davidson commented on SPARK-17143:
-

See email from user's group. I was able to find a work around. Not sure how 
hdfs:///tmp/ got created or how the permissions got messed up

##

NICE CATCH!!! Many thanks. 

I spent all day on this bug

The error msg report /tmp. I did not think to look on hdfs.

[ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/
Found 1 items
-rw-r--r--   3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp
[ec2-user@ip-172-31-22-140 notebooks]$ 


I have no idea how hdfs:///tmp got created. I deleted it. 

This causes a bunch of exceptions. These exceptions has useful message. I was 
able to fix the problem as follows

$ hadoop fs -rmr hdfs:///tmp

Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong

$ hadoop fs -chmod 777 hdfs:///tmp/hive


From: Felix Cheung 
Date: Thursday, August 18, 2016 at 3:37 PM
To: Andrew Davidson , "user @spark" 

Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp

Do you have a file called tmp at / on HDFS?




> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> ​
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> ​
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> ​
> print("spark version: {}".format(sc.version))
> ​
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
> 

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427278#comment-15427278
 ] 

Andrew Davidson commented on SPARK-17143:
-

given the exception metioned an issue with /tmp I decide to track how /tmp 
changed when run my cell

# no spark jobs are running
[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start notebook server
$ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out &

[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start the udfBug notebook
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# execute cell that define UDF
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user  
spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9
[ec2-user@ip-172-31-22-140 notebooks]$ 

[ec2-user@ip-172-31-22-140 notebooks]$ find 
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat