tdas commented on a change in pull request #26162: [SPARK-29438][SS] Use 
partition ID of StateStoreAwareZipPartitionsRDD for determining partition ID of 
state store in stream-stream join
URL: https://github.com/apache/spark/pull/26162#discussion_r368338952
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala
 ##########
 @@ -137,7 +140,7 @@ class MemoryStreamScanBuilder(stream: MemoryStreamBase[_]) 
extends ScanBuilder w
  * is intended for use in unit tests as it can only replay data when the 
object is still
  * available.
  */
-case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext)
+case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext, 
numPartitions: Int = 1)
 
 Review comment:
   I am little concerned about changing the default behavior of this class. 
Earlier the default behavior was to put each "new block" into a partition 
(i.e., each new seq of data added through `addData` formed a partition). So in 
a unit test, if there were multiple `AddData` made before starting a stream, 
the first batch could have multiple partitions. With this change, there will be 
always 1 partition. I am not sure if there are any unit test requires this 
behavior to be correct, but I am a little afraid of having an unpredictable 
impact on other tests. So I would make this param optional to None (that is, no 
fixed number of partitions), and preserve the existing behavior.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to