tdas commented on a change in pull request #26162: [SPARK-29438][SS] Use partition ID of StateStoreAwareZipPartitionsRDD for determining partition ID of state store in stream-stream join URL: https://github.com/apache/spark/pull/26162#discussion_r368338952
########## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala ########## @@ -137,7 +140,7 @@ class MemoryStreamScanBuilder(stream: MemoryStreamBase[_]) extends ScanBuilder w * is intended for use in unit tests as it can only replay data when the object is still * available. */ -case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext) +case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext, numPartitions: Int = 1) Review comment: I am little concerned about changing the default behavior of this class. Earlier the default behavior was to put each "new block" into a partition (i.e., each new seq of data added through `addData` formed a partition). So in a unit test, if there were multiple `AddData` made before starting a stream, the first batch could have multiple partitions. With this change, there will be always 1 partition. I am not sure if there are any unit test requires this behavior to be correct, but I am a little afraid of having an unpredictable impact on other tests. So I would make this param optional to None (that is, no fixed number of partitions), and preserve the existing behavior. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
