[GitHub] spark pull request #21385: [SPARK-24234][SS] Support multiple row writers in...

arunmahadevan Wed, 23 May 2018 09:47:13 -0700

Github user arunmahadevan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21385#discussion_r190320670
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/shuffle/UnsafeRowReceiver.scala
 ---
    @@ -56,20 +69,71 @@ private[shuffle] class UnsafeRowReceiver(
     
       override def receiveAndReply(context: RpcCallContext): 
PartialFunction[Any, Unit] = {
         case r: UnsafeRowReceiverMessage =>
    -      queue.put(r)
    +      queues(r.writerId).put(r)
           context.reply(())
       }
     
       override def read(): Iterator[UnsafeRow] = {
         new NextIterator[UnsafeRow] {
    -      override def getNext(): UnsafeRow = queue.take() match {
    -        case ReceiverRow(r) => r
    -        case ReceiverEpochMarker() =>
    -          finished = true
    -          null
    +      // An array of flags for whether each writer ID has gotten an epoch 
marker.
    +      private val writerEpochMarkersReceived = 
Array.fill(numShuffleWriters)(false)
    +
    +      private val executor = 
Executors.newFixedThreadPool(numShuffleWriters)
    --- End diff --
    
    Are we creating threads equal to num writers here and do we need it? I 
assume there would be already the RDD task thread per partition that's invoking 
`getNext` and cant we return the row from the right queue(s) there ? My concern 
is that these threads are going to bloat up the total threads to m*n (rdd 
partitions * shuffle writers) and this may not be very scalable considering the 
amount of context switch.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21385: [SPARK-24234][SS] Support multiple row writers in...

Reply via email to