On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <[email protected]> wrote:
> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <[email protected]> wrote: > >> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <[email protected]> >> wrote: >> >>> Reviving this thread. I think SDF is a pretty big risk for Spark runner >>> streaming. Holden, is it correct that Spark appears to have no way at all >>> to produce an infinite DStream from a finite RDD? Maybe we can somehow >>> dynamically create a new DStream for every initial restriction, said >>> DStream being obtained using a Receiver that under the hood actually runs >>> the SDF? (this is of course less efficient than a timer-capable runner >>> would do, and I have doubts about the fault tolerance) >>> >> So on the streaming side we could simply do it with a fixed number of >> levels on DStreams. It’s not great but it would work. >> > Not sure I understand this. Let me try to clarify what SDF demands of the > runner. Imagine the following case: a file contains a list of "master" > Kafka topics, on which there are published additional Kafka topics to read. > > PCollection<String> masterTopics = TextIO.read().from(masterTopicsFile) > PCollection<String> nestedTopics = > masterTopics.apply(ParDo(ReadFromKafkaFn)) > PCollection<String> records = nestedTopics.apply(ParDo(ReadFromKafkaFn)) > > This exemplifies both use cases of a streaming SDF that emits infinite > output for every input: > - Applying it to a finite set of inputs (in this case to the result of > reading a text file) > - Applying it to an infinite set of inputs (i.e. having an unbounded > number of streams being read concurrently, each of the streams themselves > is unbounded too) > > Does the multi-level solution you have in mind work for this case? I > suppose the second case is harder, so we can focus on that. > So none of those are a splittabledofn right? Assuming that we have a given dstream though in Spark we can get the underlying RDD implementation for each microbatch and do our work inside of that. > > >> >> More generally this does raise an important question if we want to target >> datasets instead of rdds/DStreams in which case i would need to do some >> more poking. >> >> >>> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <[email protected]> wrote: >>> >>>> How would timers be implemented? By outputing and reprocessing, the >>>> same way you proposed for SDF? >>>> >>> i mean the timers could be inside the mappers within the system. Could >> use a singleton so if a partition is re-executed it doesn’t end up as a >> straggler. >> >>> >>>> >>>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <[email protected]> >>>> wrote: >>>> >>>>> So the timers would have to be in our own code. >>>>> >>>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <[email protected]> >>>>> wrote: >>>>> >>>>>> Does Spark have support for timers? (I know it has support for state) >>>>>> >>>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <[email protected]> wrote: >>>>>> >>>>>>> Could we alternatively use a state mapping function to keep track of >>>>>>> the computation so far instead of outputting V each time? (also the >>>>>>> progress so far is probably of a different type R rather than V). >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> So we had a quick chat about what it would take to add something >>>>>>>> like SplittableDoFns to Spark. I'd done some sketchy thinking about >>>>>>>> this >>>>>>>> last year but didn't get very far. >>>>>>>> >>>>>>>> My back-of-the-envelope design was as follows: >>>>>>>> For input type T >>>>>>>> Output type V >>>>>>>> >>>>>>>> Implement a mapper which outputs type (T, V) >>>>>>>> and if the computation finishes T will be populated otherwise V >>>>>>>> will be >>>>>>>> >>>>>>>> For determining how long to run we'd up to either K seconds or >>>>>>>> listen for a signal on a port >>>>>>>> >>>>>>>> Once we're done running we take the result and filter for the ones >>>>>>>> with T and V into seperate collections re-run until finished >>>>>>>> and then union the results >>>>>>>> >>>>>>>> >>>>>>>> This is maybe not a great design but it was minimally complicated >>>>>>>> and I figured terrible was a good place to start and improve from. >>>>>>>> >>>>>>>> >>>>>>>> Let me know your thoughts, especially the parts where this is worse >>>>>>>> than I remember because its been awhile since I thought about this. >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Twitter: https://twitter.com/holdenkarau >>>>>>>> >>>>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> >>>> -- >> Twitter: https://twitter.com/holdenkarau >> > -- Twitter: https://twitter.com/holdenkarau
