Re: How to parallelize zip file processing?

2018-08-13 Thread mytramesh


Thanks for your reply. DataSet I am receiving from MainFrames system which I
don't have control . 

Tried below things to move data to other executors but not succeeded

  1. Called repartition method, data got re-partitioned but on same
executor. Only one core is processing all these partitions. 

  2.  Once I read zip files into RDD , saved to S3 file system and
re-reading as distributable file. In this scenario also data is getting
loaded to one executor and one core is processing this data. 

  any suggestion to move this data to other executors ? 



 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to parallelize zip file processing?

2018-08-10 Thread Jörn Franke
Does the zip file contain only one file? I fear in this case you can only have 
one core. 

Do you mean by the way gzip? In this case you cannot decompress it in 
parallel...

How is the zip file created ? Can’t you create several ones?

> On 10. Aug 2018, at 22:54, mytramesh  wrote:
> 
> I know, spark doesn’t support zip file directly since it not distributable.
> Any techniques to process this file quickly?
> 
> I am trying to process around 4GB zip file. All data is moving one executor,
> and only one task is getting assigned to process all the data. 
> 
> Even when I run repartition method, data is getting portioned but on same
> executor. 
> 
> 
> How to distribute data to other executors? 
> How to get assigned more tasks/threads when It got portioned on same
> executor? 
> 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to parallelize zip file processing?

2018-08-10 Thread mytramesh
I know, spark doesn’t support zip file directly since it not distributable.
Any techniques to process this file quickly?

I am trying to process around 4GB zip file. All data is moving one executor,
and only one task is getting assigned to process all the data. 

Even when I run repartition method, data is getting portioned but on same
executor. 


How to distribute data to other executors? 
How to get assigned more tasks/threads when It got portioned on same
executor? 




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org