Re: Get S3 Parquet File

Gourav Sengupta Fri, 24 Feb 2017 05:46:47 -0800

Hi Benjamin,

First of all fetching data from S3 while writing a code in on premise
system is a very bad idea. You might want to first copy the data in to
local HDFS before running your code. Ofcourse this depends on the volume of
data and internet speed that you have.


The platform which makes your data at least 10 times faster is SPARK 2.1.
And trust me you do not want to be writing code which needs you to update
it once again in 6 months because newer versions of SPARK now find it
deprecated.


Regards,
Gourav Sengupta



On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Hi Gourav,
>
> My answers are below.
>
> Cheers,
> Ben
>
>
> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
> wrote:
>
> Can I ask where are you running your CDH? Is it on premise or have you
> created a cluster for yourself in AWS? Our cluster in on premise in our
> data center.
>
> Also I have really never seen use s3a before, that was used way long
> before when writing s3 files took a long time, but I think that you are
> reading it.
>
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are
> lots of apis which are new and the existing ones are being deprecated.
> Therefore there is a very high chance that you are already working on code
> which is being deprecated by the SPARK community right now. We use CDH
> and upgrade with whatever Spark version they include, which is 1.6.0. We
> are waiting for the move to Spark 2.0/2.1.
>
> And besides that would you not want to work on a platform which is at
> least 10 times faster What would that be?
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
>> Parquet file from AWS S3. We can read the schema and show some data when
>> the file is loaded into a DataFrame, but when we try to do some operations,
>> such as count, we get this error below.
>>
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
>> credentials from any provider in the chain
>>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
>> getCredentials(AWSCredentialsProviderChain.java:117)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke
>> (AmazonS3Client.java:3779)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBu
>> cket(AmazonS3Client.java:1107)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBu
>> cketExist(AmazonS3Client.java:1070)
>>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSys
>> tem.java:239)
>>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.
>> java:2711)
>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem
>> .java:2748)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:
>> 2730)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReade
>> r.java:385)
>>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
>> ParquetRecordReader.java:162)
>>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordR
>> eader.java:145)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(
>> SqlNewHadoopRDD.scala:180)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD
>> .scala:126)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:73)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:229)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> Can anyone help?
>>
>> Cheers,
>> Ben
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>

Re: Get S3 Parquet File

Reply via email to