Re: createDataframe from s3 results in error

Igor Costa Sun, 07 Jun 2015 19:42:07 -0700

Hey there Ignacio

Like Reynold said, It's related to your build of Spark, try to not compile
with Thrift.


Also, try to use this command to see what's the error and link to here.

sc.wholeTextFile("s3://my-directory/2015*/ignacio/*")


Ps( Are you using boto to connect? Which version?)


Igor


On Tue, Jun 2, 2015 at 7:26 PM, Reynold Xin <r...@databricks.com> wrote:

> Maybe an incompatible Hive package or Hive metastore?
>
> On Tue, Jun 2, 2015 at 3:25 PM, Ignacio Zendejas <i...@node.io> wrote:
>
>> From RELEASE:
>>
>> "Spark 1.3.1 built for Hadoop 2.4.0
>>
>> Build flags: -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests
>> -Pkinesis-asl -Pspark-ganglia-lgpl -Phadoop-provided -Phive
>> -Phive-thriftserver
>>
>> "
>> And this stacktrace may be more useful:
>> http://pastebin.ca/3016483
>>
>> On Tue, Jun 2, 2015 at 3:13 PM, Ignacio Zendejas <i...@node.io> wrote:
>>
>>> I've run into an error when trying to create a dataframe. Here's the
>>> code:
>>>
>>> --
>>> from pyspark import StorageLevel
>>> from pyspark.sql import Row
>>>
>>> table = 'blah'
>>> ssc = HiveContext(sc)
>>>
>>> data = sc.textFile('s3://bucket/some.tsv')
>>>
>>> def deserialize(s):
>>>   p = s.strip().split('\t')
>>>   p[-1] = float(p[-1])
>>>   return Row(normalized_page_sha1=p[0], name=p[1], phrase=p[2],
>>> created_at=p[3], layer_id=p[4], score=p[5])
>>>
>>> blah = data.map(deserialize)
>>> df = sqlContext.inferSchema(blah)
>>>
>>> ---
>>>
>>> I've also tried s3n and using createDataFrame. Our setup is on EMR
>>> instances, using the setup script Amazon provides. After lots of debugging,
>>> I suspect there'll be a problem with this setup.
>>>
>>> What's weird is that if I run this on pyspark shell, and re-run the last
>>> line (inferSchema/createDataFrame), it actually works.
>>>
>>> We're getting warnings like this:
>>> http://pastebin.ca/3016476
>>>
>>> Here's the actual error:
>>> http://www.pastebin.ca/3016473
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Thanks,
>>> Ignacio
>>>
>>>
>>
>

Re: createDataframe from s3 results in error

Reply via email to