Github user jose-torres commented on a diff in the pull request:
https://github.com/apache/spark/pull/20647#discussion_r170106104
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala
---
@@ -415,12 +418,14 @@ class MicroBatchExecution(
case v1: SerializedOffset => reader.deserializeOffset(v1.json)
case v2: OffsetV2 => v2
}
- reader.setOffsetRange(
- toJava(current),
- Optional.of(availableV2))
+ reader.setOffsetRange(toJava(current), Optional.of(availableV2))
logDebug(s"Retrieving data from $reader: $current ->
$availableV2")
- Some(reader ->
- new
StreamingDataSourceV2Relation(reader.readSchema().toAttributes, reader))
+ Some(reader -> StreamingDataSourceV2Relation(
--- End diff --
When it comes to the streaming execution code, the basic problem is that it
was more evolved than designed. For example, there's no particular reason to
use a logical plan; the map is only ever used in order to construct another map
of source -> physical plan stats. Untangling StreamExecution is definitely
something we need to do, but that's going to be annoying and I think it's
sufficiently orthogonal to the V2 migration to put off.
There's currently no design doc for the streaming aspects of DataSourceV2.
We kinda rushed an experimental version out the door, because it was coupled
with the experimental ContinuousExecution streaming mode. I'm working on going
back and cleaning things up; I'll send docs to the dev list and make sure to @
you on the changes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]