[
https://issues.apache.org/jira/browse/SPARK-9072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628479#comment-14628479
]
Murtaza Kanchwala commented on SPARK-9072:
------------------------------------------
Some of the relevant things which I googled
http://search-hadoop.com/m/q3RTtzU2FI1Mo1QA&subj=Spark+will+process+_temporary+folder+on+S3+is+very+slow+and+always+cause+failure
http://stackoverflow.com/questions/26291165/spark-sql-unable-to-complete-writing-parquet-data-with-a-large-number-of-shards?lq=1
http://stackoverflow.com/questions/26332542/saving-a-25t-schemardd-in-parquet-format-on-s3
https://forums.databricks.com/questions/1097/stall-on-loading-many-parquet-files-on-s3.html
> Parquet : Writing data to S3 very slowly
> ----------------------------------------
>
> Key: SPARK-9072
> URL: https://issues.apache.org/jira/browse/SPARK-9072
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Reporter: Murtaza Kanchwala
> Priority: Critical
> Labels: parquet
> Fix For: 1.5.0
>
>
> I've created spark programs through which I am converting the normal textfile
> to parquet and csv to S3.
> There is around 8 TB of data and I need to compress it into lower for further
> processing on Amazon EMR
> Results :
> 1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems
> successfully to S3.
> 2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still
> after the Job completion it is spilling/writing the data separately to S3
> which is making it slower and in starvation.
> Input : s3n://<SameBucket>/input
> Output : s3n://<SameBucket>/output/parquet
> Lets say If I have around 10K files then it is taking 1000 files / 20 min to
> write back in S3.
> Note :
> Also I found that program is creating temp folder on S3 output location, And
> in Logs I've seen S3ReadDelays.
> Can anyone tell me what am I doing wrong? or is there anything I need to add
> so that the Spark App cant create temp folder on S3 and do write ups fast
> from EMR to S3 just like saveAsTextFile. Thanks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]