[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430278#comment-15430278 ] Sean Owen commented on SPARK-17143: --- This sounds like an HDFS environment problem then. This dir would exist and be writable to all users when HDFS's file system is created. > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >1567 pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command, self) >1568 ctx = SQLContext.getOrCreate(sc) > -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json()) >1570 if name is None: >1571 name = f.__name__ if hasattr(f, '__name__') else > f.__class__.__name__ > /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self) > 681 try: > 682 if not hasattr(self, '_scala_HiveContext'): > --> 683 self._scala_HiveContext = self._get_hive_ctx() > 684 return self._scala_HiveContext > 685 except Py4JError as e: > /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self) > 690 > 691 def _get_hive_ctx(self): > --> 692 return self._jvm.HiveContext(self._jsc.sc()) > 693 > 694 def refreshTable(self, tableName): > /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) >1062 answer =
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427394#comment-15427394 ] Andrew Davidson commented on SPARK-17143: - See email from user's group. I was able to find a work around. Not sure how hdfs:///tmp/ got created or how the permissions got messed up ## NICE CATCH!!! Many thanks. I spent all day on this bug The error msg report /tmp. I did not think to look on hdfs. [ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/ Found 1 items -rw-r--r-- 3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp [ec2-user@ip-172-31-22-140 notebooks]$ I have no idea how hdfs:///tmp got created. I deleted it. This causes a bunch of exceptions. These exceptions has useful message. I was able to fix the problem as follows $ hadoop fs -rmr hdfs:///tmp Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong $ hadoop fs -chmod 777 hdfs:///tmp/hive From: Felix CheungDate: Thursday, August 18, 2016 at 3:37 PM To: Andrew Davidson , "user @spark" Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp Do you have a file called tmp at / on HDFS? > pyspark unable to create UDF: java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > --- > > Key: SPARK-17143 > URL: https://issues.apache.org/jira/browse/SPARK-17143 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] >Reporter: Andrew Davidson > Attachments: udfBug.html, udfBug.ipynb > > > For unknown reason I can not create UDF when I run the attached notebook on > my cluster. I get the following error > Py4JJavaError: An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext. > : java.lang.RuntimeException: > org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a > directory: /tmp tmp > The notebook runs fine on my Mac > In general I am able to run non UDF spark code with out any trouble > I start the notebook server as the user “ec2-user" and uses master URL > spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066 > I found the following message in the notebook server log file. I have log > level set to warn > 16/08/18 21:38:45 WARN ObjectStore: Version information not found in > metastore. hive.metastore.schema.verification is not enabled so recording the > schema version 1.2.0 > 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning > NoSuchObjectException > The cluster was originally created using > spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2 > #from pyspark.sql import SQLContext, HiveContext > #sqlContext = SQLContext(sc) > > #from pyspark.sql import DataFrame > #from pyspark.sql import functions > > from pyspark.sql.types import StringType > from pyspark.sql.functions import udf > > print("spark version: {}".format(sc.version)) > > import sys > print("python version: {}".format(sys.version)) > spark version: 1.6.1 > python version: 3.4.3 (default, Apr 1 2015, 18:10:40) > [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] > # functions.lower() raises > # py4j.Py4JException: Method lower([class java.lang.String]) does not exist > # work around define a UDF > toLowerUDFRetType = StringType() > #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > toLowerUDF = udf(lambda s : s.lower(), StringType()) > You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt > assembly > Py4JJavaErrorTraceback (most recent call last) > in () > 4 toLowerUDFRetType = StringType() > 5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType) > > 6 toLowerUDF = udf(lambda s : s.lower(), StringType()) > /root/spark/python/pyspark/sql/functions.py in udf(f, returnType) >1595 [Row(slen=5), Row(slen=3)] >1596 """ > -> 1597 return UserDefinedFunction(f, returnType) >1598 >1599 blacklist = ['map', 'since', 'ignore_unicode_prefix'] > /root/spark/python/pyspark/sql/functions.py in __init__(self, func, > returnType, name) >1556 self.returnType = returnType >1557 self._broadcast = None > -> 1558 self._judf = self._create_judf(name) >1559 >1560 def _create_judf(self, name): > /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name) >
[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp
[ https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427278#comment-15427278 ] Andrew Davidson commented on SPARK-17143: - given the exception metioned an issue with /tmp I decide to track how /tmp changed when run my cell # no spark jobs are running [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start notebook server $ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out & [ec2-user@ip-172-31-22-140 notebooks]$ !ls ls /tmp/ hsperfdata_ec2-user hsperfdata_root pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # start the udfBug notebook [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user [ec2-user@ip-172-31-22-140 notebooks]$ # execute cell that define UDF [ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/ hsperfdata_ec2-user hsperfdata_root libnetty-transport-native-epoll818283657820702.so pip_build_ec2-user spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9 [ec2-user@ip-172-31-22-140 notebooks]$ [ec2-user@ip-172-31-22-140 notebooks]$ find /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/ /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0 /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat /tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat