[GitHub] [beam] iemejia commented on pull request #12603: [WIP][BEAM-10670] Make SparkRunner opt-out for using an SDF powered Read transform.

GitBox Wed, 23 Sep 2020 14:46:56 -0700


iemejia commented on pull request #12603:
URL: https://github.com/apache/beam/pull/12603#issuecomment-697989304

The phenomenon of microbatches producing results early I noticed it too in
the past when trying to enable the Read.Unbounded tests. I could not understand
why, and I thought it was probably due to some glitch in Spark implementation
or us screwing their scheduling but I struggled to debug the issue properly.

> Since watermark holds don't seem to be implemented, does the
GroupAlsoViaWindowSet hold back the watermark for elements that it currently
has buffered?

Probably, at least that may explain some of the inconsistencies.

> Can you explain how the GlobalWatermarkHolder works, can I register
anything as a sourceId?

In all honesty I am not so familiar with watermark handling on the Spark
runner. I took a look at the GlobalWatermarkHolder class and tried to figure
out but it was not really evident.

My impression is that the sourceId is aligned somehow with Spark's assigned
streamId, but I might be misinterpreting it.

https://github.com/apache/spark/blob/13664434387e338a5029e73a4388943f34e3fc07/streaming/src/main/scala/org/apache/spark/streaming/scheduler/InputInfoTracker.scala#L30

I wish I could help more but that part of the code is also not so well
documented. I doubt that the original authors of the code still remember the
details but maybe they remember at least the intentions of
`GlobalWatermarkHolder` and its use, and maybe if there were any open issues.
Just in case :crossed_fingers: maybe: @amitsela @aviemzur @staslev

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] iemejia commented on pull request #12603: [WIP][BEAM-10670] Make SparkRunner opt-out for using an SDF powered Read transform.

Reply via email to