date:20190311

Re: spark df.write.partitionBy run very slow

2019-03-11 Thread JF Chen

Hi Finally I found the reason... It caused by some long time gc on some datanodes. After receiving the data from executors, the data node with long gc cannot report blocks to namenode, so the writing progress takes a long time. Now I have decommissioned the broken data nodes, and now my spark runs

read json and write into parquet in executors

2019-03-11 Thread Lian Jiang

Hi, In my spark batch job, step 1: the driver assigns a partition of json file path list to each executor. step 2: each executor gets these assigned json files from S3 and save into hdfs. step 3: the driver read these json files into a data frame and save into parquet. To improve performance by

unsubscribe

2019-03-11 Thread Byron Lee

returning type of function that needs to be passed to method 'mapWithState'

2019-03-11 Thread shicheng31...@gmail.com

Hi all： In the `mapWithState`method in spark streaming, you need to pass in an anonymous function. This function maintains a state and should return a result. It can be said that the final stateful result can be obtained from the state object. So, what is the significance of