Tathagata, Thanks for stating your preference for Approach 2.
My use case and motivation are similar to the concerns raised by others in SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability for Spark Streaming applications to process the files in an input directory that existed before the streaming application began, and for some projects that we did for our customers, we relied on that feature. Starting from 1.2.x series, we are limited in this respect to the files whose time stamp is not older than 1 minute. The only workaround is to 'touch' those files before starting a streaming application. Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1 minute, and I don't see any argument why it cannot be set to another arbitrary value (keeping the default value of 1 minute, if nothing is set by the user). Putting all this together, my plan is to create a Pull Request that is like 1- Convert "private val MIN_REMEMBER_DURATION" into "private val minRememberDuration" (to reflect the change that it is not a constant in the sense that it can be set via configuration) 2- Set its value by using something like getConf("spark.streaming.minRememberDuration", Minutes(1)) 3- Document the spark.streaming.minRememberDuration in Spark Streaming Programming Guide If the above sounds fine, then I'll go on implementing this small change and submit a pull request for fixing SPARK-3276. What do you say? Kind regards, Emre Sevinç http://www.bigindustries.be/ On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das <t...@databricks.com> wrote: > Approach 2 is definitely better :) > Can you tell us more about the use case why you want to do this? > > TD > > On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc <emre.sev...@gmail.com> wrote: > >> Hello, >> >> This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is >> now a constant) a variable (configurable, with a default value). Before >> spending effort on developing something and creating a pull request, I >> wanted to consult with the core developers to see which approach makes >> most >> sense, and has the higher probability of being accepted. >> >> The constant MIN_REMEMBER_DURATION can be seen at: >> >> >> >> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338 >> >> it is marked as private member of private[streaming] object >> FileInputDStream. >> >> Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of >> minRememberDuration, and then add a new fileStream method to >> JavaStreamingContext.scala : >> >> >> >> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala >> >> such that the new fileStream method accepts a new parameter, e.g. >> minRememberDuration: Int (in seconds), and then use this value to set the >> private minRememberDuration. >> >> >> Approach 2: Create a new, public Spark configuration property, e.g. named >> spark.rememberDuration.min (with a default value of 60 seconds), and then >> set the private variable minRememberDuration to the value of this Spark >> property. >> >> >> Approach 1 would mean adding a new method to the public API, Approach 2 >> would mean creating a new public Spark property. Right now, approach 2 >> seems more straightforward and simpler to me, but nevertheless I wanted to >> have the opinions of other developers who know the internals of Spark >> better than I do. >> >> Kind regards, >> Emre Sevinç >> > > -- Emre Sevinc