Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6423#discussion_r31363983
  
    --- Diff: 
core/src/main/scala/org/apache/spark/shuffle/hash/BlockStoreShuffleFetcher.scala
 ---
    @@ -17,23 +17,22 @@
     
     package org.apache.spark.shuffle.hash
     
    -import scala.collection.mutable.ArrayBuffer
    -import scala.collection.mutable.HashMap
    -import scala.util.{Failure, Success, Try}
    +import java.io.InputStream
     
     import org.apache.spark._
    -import org.apache.spark.serializer.Serializer
     import org.apache.spark.shuffle.FetchFailedException
     import org.apache.spark.storage.{BlockId, BlockManagerId, 
ShuffleBlockFetcherIterator, ShuffleBlockId}
     import org.apache.spark.util.CompletionIterator
     
    +import scala.collection.mutable.{ArrayBuffer, HashMap}
    +import scala.util.{Failure, Success, Try}
    +
     private[hash] object BlockStoreShuffleFetcher extends Logging {
    --- End diff --
    
    This might be something of a sticking point since I think it's going to be 
pretty hard for us to be able to guarantee strong API stability for these 
internal interfaces.  Maybe we can chat a bit more about requirements / 
use-cases here (or over email / JIRA) to see if there's a way to address the 
typical reasons for wanting a customized shuffle without needing to expose 
large amounts of these internals.
    
    One gotcha is that the ShuffleManager is a SparkConf-wide setting, so it's 
kind of a coarse-grained extension point. Spark already supports customization 
of the serializer on a per-shuffle basis (see `ShuffledRDD` / 
`ShuffleDependency`), but I guess the problem for you might be that the 
interfaces there are record-at-a-time whereas it sounds like you really want 
something that's more batch-oriented in order to do the Parquet compression 
tricks.
    
    I wonder whether we can implement new ShuffledRDD-like classes which 
exposes stable APIs for block / batch-oriented shuffle writing and reading.  
This might be relevant for Project Tungsten / Spark SQL as well.  Note that 
this would not eliminate the need for this patch's changes, since we'd still 
want to push the deserialization higher up the call stack.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to