venu k tangirala created SPARK-5866:
---------------------------------------
Summary: pyspark read from s3
Key: SPARK-5866
URL: https://issues.apache.org/jira/browse/SPARK-5866
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.2.1
Environment: mac OSx and ec2 ubuntu
Reporter: venu k tangirala
I get the following error :
File "/Users/myname/leeo/path/./spark_json.py", line 55, in <module>
vals_table = sqlContext.inferSchema(values)
File "/Users/myname/spark-1.2.1/python/pyspark/sql.py", line 1332, in
inferSchema
first = rdd.first()
File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1139, in first
rs = self.take(1)
File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take
totalParts = self._jrdd.partitions().size()
File
"/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py",
line 538, in __call__
self.target_id, self.name)
File
"/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py",
line 300, in get_return_value
format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does
not exist: s3://bucketName/pathS3/1111_1417479684
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61)
at
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
at
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53)
at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:724)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]