Re: Splittable DoFN in Spark discussion

Holden Karau Fri, 23 Mar 2018 15:49:58 -0700

On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <[email protected]>
wrote:


> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <[email protected]> wrote:
>
>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <[email protected]>
>> wrote:
>>
>>> Reviving this thread. I think SDF is a pretty big risk for Spark runner
>>> streaming. Holden, is it correct that Spark appears to have no way at all
>>> to produce an infinite DStream from a finite RDD? Maybe we can somehow
>>> dynamically create a new DStream for every initial restriction, said
>>> DStream being obtained using a Receiver that under the hood actually runs
>>> the SDF? (this is of course less efficient than a timer-capable runner
>>> would do, and I have doubts about the fault tolerance)
>>>
>> So on the streaming side we could simply do it with a fixed number of
>> levels on DStreams. It’s not great but it would work.
>>
> Not sure I understand this. Let me try to clarify what SDF demands of the
> runner. Imagine the following case: a file contains a list of "master"
> Kafka topics, on which there are published additional Kafka topics to read.
>
> PCollection<String> masterTopics = TextIO.read().from(masterTopicsFile)
> PCollection<String> nestedTopics =
> masterTopics.apply(ParDo(ReadFromKafkaFn))
> PCollection<String> records = nestedTopics.apply(ParDo(ReadFromKafkaFn))
>
> This exemplifies both use cases of a streaming SDF that emits infinite
> output for every input:
> - Applying it to a finite set of inputs (in this case to the result of
> reading a text file)
> - Applying it to an infinite set of inputs (i.e. having an unbounded
> number of streams being read concurrently, each of the streams themselves
> is unbounded too)
>
> Does the multi-level solution you have in mind work for this case? I
> suppose the second case is harder, so we can focus on that.
>
So none of those are a splittabledofn right?

Assuming that we have a given dstream though in Spark we can get the
underlying RDD implementation for each microbatch and do our work inside of
that.

>
>
>>
>> More generally this does raise an important question if we want to target
>> datasets instead of rdds/DStreams in which case i would need to do some
>> more poking.
>>
>>
>>> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <[email protected]> wrote:
>>>
>>>> How would timers be implemented? By outputing and reprocessing, the
>>>> same way you proposed for SDF?
>>>>
>>> i mean the timers could be inside the mappers within the system. Could
>> use a singleton so if a partition is re-executed it doesn’t end up as a
>> straggler.
>>
>>>
>>>>
>>>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <[email protected]>
>>>> wrote:
>>>>
>>>>> So the timers would have to be in our own code.
>>>>>
>>>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Does Spark have support for timers? (I know it has support for state)
>>>>>>
>>>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <[email protected]> wrote:
>>>>>>
>>>>>>> Could we alternatively use a state mapping function to keep track of
>>>>>>> the computation so far instead of outputting V each time? (also the
>>>>>>> progress so far is probably of a different type R rather than V).
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> So we had a quick chat about what it would take to add something
>>>>>>>> like SplittableDoFns to Spark. I'd done some sketchy thinking about 
>>>>>>>> this
>>>>>>>> last year but didn't get very far.
>>>>>>>>
>>>>>>>> My back-of-the-envelope design was as follows:
>>>>>>>> For input type T
>>>>>>>> Output type V
>>>>>>>>
>>>>>>>> Implement a mapper which outputs type (T, V)
>>>>>>>> and if the computation finishes T will be populated otherwise V
>>>>>>>> will be
>>>>>>>>
>>>>>>>> For determining how long to run we'd up to either K seconds or
>>>>>>>> listen for a signal on a port
>>>>>>>>
>>>>>>>> Once we're done running we take the result and filter for the ones
>>>>>>>> with T and V into seperate collections re-run until finished
>>>>>>>> and then union the results
>>>>>>>>
>>>>>>>>
>>>>>>>> This is maybe not a great design but it was minimally complicated
>>>>>>>> and I figured terrible was a good place to start and improve from.
>>>>>>>>
>>>>>>>>
>>>>>>>> Let me know your thoughts, especially the parts where this is worse
>>>>>>>> than I remember because its been awhile since I thought about this.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau

Re: Splittable DoFN in Spark discussion

Reply via email to