Hey there Ignacio Like Reynold said, It's related to your build of Spark, try to not compile with Thrift.
Also, try to use this command to see what's the error and link to here. sc.wholeTextFile("s3://my-directory/2015*/ignacio/*") Ps( Are you using boto to connect? Which version?) Igor On Tue, Jun 2, 2015 at 7:26 PM, Reynold Xin <r...@databricks.com> wrote: > Maybe an incompatible Hive package or Hive metastore? > > On Tue, Jun 2, 2015 at 3:25 PM, Ignacio Zendejas <i...@node.io> wrote: > >> From RELEASE: >> >> "Spark 1.3.1 built for Hadoop 2.4.0 >> >> Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests >> -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Phive >> -Phive-thriftserver >> >> " >> And this stacktrace may be more useful: >> http://pastebin.ca/3016483 >> >> On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas <i...@node.io> wrote: >> >>> I've run into an error when trying to create a dataframe. Here's the >>> code: >>> >>> -- >>> from pyspark import StorageLevel >>> from pyspark.sql import Row >>> >>> table = 'blah' >>> ssc = HiveContext(sc) >>> >>> data = sc.textFile('s3://bucket/some.tsv') >>> >>> def deserialize(s): >>> p = s.strip().split('\t') >>> p[-1] = float(p[-1]) >>> return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2], >>> created_at=p[3], layer_id=p[4], score=p[5]) >>> >>> blah = data.map(deserialize) >>> df = sqlContext.inferSchema(blah) >>> >>> --- >>> >>> I've also tried s3n and using createDataFrame. Our setup is on EMR >>> instances, using the setup script Amazon provides. After lots of debugging, >>> I suspect there'll be a problem with this setup. >>> >>> What's weird is that if I run this on pyspark shell, and re-run the last >>> line (inferSchema/createDataFrame), it actually works. >>> >>> We're getting warnings like this: >>> http://pastebin.ca/3016476 >>> >>> Here's the actual error: >>> http://www.pastebin.ca/3016473 >>> >>> Any help would be greatly appreciated. >>> >>> Thanks, >>> Ignacio >>> >>> >> >