> I am using both 1.4.1 and 1.5.1.

That's the Spark version. I'm wondering what version of Hadoop your Spark
is built against.

For example, when you download Spark
<http://spark.apache.org/downloads.html> you have to select from a number
of packages (under "Choose a package type"), and each is built against a
different version of Hadoop. When Spark is built against Hadoop 2.6+, from
my understanding, you need to install additional libraries
<https://issues.apache.org/jira/browse/SPARK-7481> to access S3. When Spark
is built against Hadoop 2.4 or earlier, you don't need to do this.

I'm confirming that this is what is happening in your case.

Nick

On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote:

> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of the
> new feature for instance-profile which greatly helps with this as well.
> Without the instance-profile, we got it working by copying a
> .aws/credentials file up to each node. We could easily automate that
> through the templates.
>
> I don't need any additional libraries. We just need to change the
> core-site.xml
>
> -Christian
>
> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for sharing this, Christian.
>>
>> What build of Spark are you using? If I understand correctly, if you are
>> using Spark built against Hadoop 2.6+ then additional configs alone won't
>> help because additional libraries also need to be installed
>> <https://issues.apache.org/jira/browse/SPARK-7481>.
>>
>> Nick
>>
>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com> wrote:
>>
>>> We ended up reading and writing to S3 a ton in our Spark jobs.
>>> For this to work, we ended up having to add s3a, and s3 key/secret
>>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>>
>>> I thought maybe I'd share what we did and it might be worth adding these
>>> to the spark conf for out of the box functionality with S3.
>>>
>>> We created:
>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>>
>>> We changed the contents form the original, adding in the following:
>>>
>>>   <property>
>>>     <name>fs.file.impl</name>
>>>     <value>org.apache.hadoop.fs.LocalFileSystem</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.hdfs.impl</name>
>>>     <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.s3.impl</name>
>>>     <value>org.apache.hadoop.fs.s3native.NativeS3FileSystem</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.s3.awsAccessKeyId</name>
>>>     <value>{{aws_access_key_id}}</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.s3.awsSecretAccessKey</name>
>>>     <value>{{aws_secret_access_key}}</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.s3n.awsAccessKeyId</name>
>>>     <value>{{aws_access_key_id}}</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.s3n.awsSecretAccessKey</name>
>>>     <value>{{aws_secret_access_key}}</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.s3a.awsAccessKeyId</name>
>>>     <value>{{aws_access_key_id}}</value>
>>>   </property>
>>>
>>>   <property>
>>>     <name>fs.s3a.awsSecretAccessKey</name>
>>>     <value>{{aws_secret_access_key}}</value>
>>>   </property>
>>>
>>> This change makes spark on ec2 work out of the box for us. It took us
>>> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
>>> version 2.
>>>
>>> Best Regards,
>>> Christian
>>>
>>
>

Reply via email to