[jira] [Created] (SPARK-5866) pyspark read from s3

venu k tangirala (JIRA) Tue, 17 Feb 2015 10:52:39 -0800

venu k tangirala created SPARK-5866:
---------------------------------------


             Summary: pyspark read from s3
                 Key: SPARK-5866
                 URL: https://issues.apache.org/jira/browse/SPARK-5866
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.2.1
         Environment: mac OSx and ec2 ubuntu
            Reporter: venu k tangirala


I get the following error :
File "/Users/myname/leeo/path/./spark_json.py", line 55, in <module>
    vals_table = sqlContext.inferSchema(values)
  File "/Users/myname/spark-1.2.1/python/pyspark/sql.py", line 1332, in 
inferSchema
    first = rdd.first()
  File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1139, in first
    rs = self.take(1)
  File "/Users/myname/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take
    totalParts = self._jrdd.partitions().size()
  File 
"/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/java_gateway.py",
 line 538, in __call__
    self.target_id, self.name)
  File 
"/anaconda/lib/python2.7/site-packages/py4j-0.8.2.1-py2.7.egg/py4j/protocol.py",
 line 300, in get_return_value
    format(target_id, '.', name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o24.partitions.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does 
not exist: s3://bucketName/pathS3/1111_1417479684
        at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
        at 
org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:61)
        at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:269)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
        at 
org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
        at 
org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:53)
        at org.apache.spark.api.java.JavaRDD.partitions(JavaRDD.scala:32)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:724)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-5866) pyspark read from s3

Reply via email to