Hello, I am currently trying to extend some custom InputSplit and RecordReader classes to provide to SparkContext's hadoopRDD() function.
My question is the following: Does the value returned by InpuSplit.getLenght() and/or RecordReader.getProgress() affect the execution of a map() function in the Spark runtime? I am asking because I have used these two custom classes on Hadoop and they do not cause any problems. However, in Spark, I see that new InputSplit objects are generated during runtime. To be more precise: In the beginning, I see in my log file that an InputSplit object is generated and the RecordReader object associated to it is fetching records. At some point, the job that is handling the previous InputSplit stops, and a new one is spawned with a new InputSplit. I do not understand why this is happening? Any help? Thank you, Nick P.S.-1 : I am sorry for posting my question on the Developer Mailing List, but I could not find anything similar in the User's list. Also, I really need to understand the runtime of Spark and I believe that in the developer's list my question will be read by contributors of Spark. P.S.-2: I can provide more technical details if they are needed.