Re: How to parallelize zip file processing?
Thanks for your reply. DataSet I am receiving from MainFrames system which I don't have control . Tried below things to move data to other executors but not succeeded 1. Called repartition method, data got re-partitioned but on same executor. Only one core is processing all these partitions. 2. Once I read zip files into RDD , saved to S3 file system and re-reading as distributable file. In this scenario also data is getting loaded to one executor and one core is processing this data. any suggestion to move this data to other executors ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How to parallelize zip file processing?
Does the zip file contain only one file? I fear in this case you can only have one core. Do you mean by the way gzip? In this case you cannot decompress it in parallel... How is the zip file created ? Can’t you create several ones? > On 10. Aug 2018, at 22:54, mytramesh wrote: > > I know, spark doesn’t support zip file directly since it not distributable. > Any techniques to process this file quickly? > > I am trying to process around 4GB zip file. All data is moving one executor, > and only one task is getting assigned to process all the data. > > Even when I run repartition method, data is getting portioned but on same > executor. > > > How to distribute data to other executors? > How to get assigned more tasks/threads when It got portioned on same > executor? > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
How to parallelize zip file processing?
I know, spark doesn’t support zip file directly since it not distributable. Any techniques to process this file quickly? I am trying to process around 4GB zip file. All data is moving one executor, and only one task is getting assigned to process all the data. Even when I run repartition method, data is getting portioned but on same executor. How to distribute data to other executors? How to get assigned more tasks/threads when It got portioned on same executor? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org