Jack Hu created SPARK-11749:
-------------------------------

             Summary: Duplicate creating the RDD in file stream when recovering 
from checkpoint data
                 Key: SPARK-11749
                 URL: https://issues.apache.org/jira/browse/SPARK-11749
             Project: Spark
          Issue Type: Bug
          Components: Streaming
    Affects Versions: 1.5.0, 1.5.2
            Reporter: Jack Hu


I have a case to monitor a HDFS folder, then enrich the incoming data from the 
HDFS folder via different table (about 15 reference tables) and send to 
different hive table after some operations. 

The code is as this:
{code}
val txt = 
ssc.textFileStream(folder).map(toKeyValuePair).reduceByKey(removeDuplicates)
val refTable1 = 
ssc.textFileStream(refSource1).map(parse(_)).updateStateByKey(...)
txt.join(refTable1).map(..).reduceByKey(...).foreachRDD(
  rdd => {
     // insert into hive table
  }
)

val refTable2 = 
ssc.textFileStream(refSource2).map(parse(_)).updateStateByKey(...)
txt.join(refTable2).map(..).reduceByKey(...).foreachRDD(
  rdd => {
     // insert into hive table
  }
)

/// more refTables in following code
{code}
 
The {{batchInterval}} of this application is set to *30 seconds*, the 
checkpoint interval is set to *10 minutes*, every batch in {{txt}} has *60 
files*

After recovered from checkpoint data, I can see lots of log to create the RDD 
in file stream: rdd in each batch of file stream was been recreated *15 times*, 
and it takes about *5 minutes* to create so much file RDD. During this period, 
*10K+ broadcast* had been created and almost used all the block manager space. 

After some investigation, we found that the {{DStream.restoreCheckpointData}} 
would be invoked at each output ({{DStream.foreachRDD}} in this case), and no 
flag to indicate that this {{DStream}} had been restored, so the RDD in file 
stream was been recreated. 

Suggest to add on flag to control the restore process to avoid the duplicated 
work.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to