Murtaza Kanchwala created SPARK-9072:
----------------------------------------

             Summary: Parquet : Writing data to S3 very slowly
                 Key: SPARK-9072
                 URL: https://issues.apache.org/jira/browse/SPARK-9072
             Project: Spark
          Issue Type: Sub-task
          Components: SQL
            Reporter: Murtaza Kanchwala
            Priority: Critical
             Fix For: 1.5.0


I've created spark programs through which I am converting the normal textfile 
to parquet and csv to S3.

There is around 8 TB of data and I need to compress it into lower for further 
processing on Amazon EMR

Results : 

1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems 
successfully to S3.

2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still 
after the Job completion it is spilling/writing the data separately to S3 which 
is making it slower and in starvation.

Input : s3n://<SameBucket>/input
Output : s3n://<SameBucket>/output/parquet

Lets say If I have around 10K files then it is taking 1000 files / 20 min to 
write back in S3.

Note : 
Also I found that program is creating temp folder on S3 output location, And in 
Logs I've seen S3ReadDelays.

Can anyone tell me what am I doing wrong? or is there anything I need to add so 
that the Spark App cant create temp folder on S3 and do write ups fast from EMR 
to S3 just like saveAsTextFile. Thanks






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to