[jira] [Work logged] (BEAM-10670) Make non-portable Splittable DoFn the only option when executing Java "Read" transforms

ASF GitHub Bot (Jira) Thu, 01 Oct 2020 15:05:26 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-10670?focusedWorklogId=493711&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-493711
 ]


ASF GitHub Bot logged work on BEAM-10670:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Oct/20 22:04
            Start Date: 01/Oct/20 22:04
    Worklog Time Spent: 10m 
      Work Description: lukecwik commented on pull request #12603:
URL: https://github.com/apache/beam/pull/12603#issuecomment-702422068


   > The phenomenon of microbatches producing results early I noticed it too in 
the past when trying to enable the Read.Unbounded tests. I could not understand 
why, and I thought it was probably due to some glitch in Spark implementation 
or us screwing their scheduling but I struggled to debug the issue properly.
   > 
   > > Since watermark holds don't seem to be implemented, does the 
GroupAlsoViaWindowSet hold back the watermark for elements that it currently 
has buffered?
   > 
   > Probably, at least that may explain some of the inconsistencies. 
   
   The Java based trigger implementation relies on this to produce correct 
results. Implementing this would like enable a bunch of streaming use cases.
   
   > > Can you explain how the GlobalWatermarkHolder works, can I register 
anything as a sourceId?
   > 
   > In all honesty I am not so familiar with watermark handling on the Spark 
runner. I took a look at the GlobalWatermarkHolder class and tried to figure 
out but it was not really evident.
   > 
   > My impression is that the sourceId is aligned somehow with Spark's 
assigned streamId, but I might be misinterpreting it.
   > 
https://github.com/apache/spark/blob/13664434387e338a5029e73a4388943f34e3fc07/streaming/src/main/scala/org/apache/spark/streaming/scheduler/InputInfoTracker.scala#L30
   > 
   > I wish I could help more but that part of the code is also not so well 
documented. I doubt that the original authors of the code still remember the 
details but maybe they remember at least the intentions of 
`GlobalWatermarkHolder` and its use, and maybe if there were any open issues. 
Just in case 🤞 maybe: @amitsela @aviemzur @staslev
   
   That would be great if someone could give guidance here.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 493711)
    Time Spent: 19h 40m  (was: 19.5h)

> Make non-portable Splittable DoFn the only option when executing Java "Read" 
> transforms
> ---------------------------------------------------------------------------------------
>
>                 Key: BEAM-10670
>                 URL: https://issues.apache.org/jira/browse/BEAM-10670
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-core
>            Reporter: Luke Cwik
>            Assignee: Luke Cwik
>            Priority: P2
>          Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> All runners seem to be capable of migrating to splittable DoFn for 
> non-portable execution except for Dataflow runner v1 which will internalize 
> the current primitive read implementation that is shared across runner 
> implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-10670) Make non-portable Splittable DoFn the only option when executing Java "Read" transforms

Reply via email to