[
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14619100#comment-14619100
]
Matt Massie commented on SPARK-7263:
------------------------------------
The Spark shuffle manager APIs, in their current state, don't support a
standalone shuffle implementation. If you like, I can split my pull request
into two parts: (a) changes to Spark, e.g. [serializing class
info|https://github.com/massie/spark/commit/fc03c0bd29fa71ff390b86a8f6fd31c1cbef960f],
making APIs public, etc and (b) the new Parquet implementation.
I think your comment that "we're creating a whole new shuffle subsystem for one
data type" is technically correct but it misses the bigger point. The currently
supported data type, {{IndexedRecord}} is the base type for all Avro objects
and includes three methods -- {{get}}, {{put}} and {{getSchema}} -- the
primitives necessary for describing, storing and building objects. Since
Parquet supports Thrift and Protobuf too, it would be straight-forward to add
their base types here too which perform similar functions.
I reached out to Michael Armbrust and looked at the Spark SQL code, in depth,
before I wrote this. I had hoped to piggyback on the Spark SQL work but found
that it wasn't a good match. If you like, I can list all the issues that I
found.
I'd like to know why you think this would be a maintenance nightmare? I think
otherwise, but of course I wrote this. Can you be more specific with your
concerns around maintenance?
> Add new shuffle manager which stores shuffle blocks in Parquet
> --------------------------------------------------------------
>
> Key: SPARK-7263
> URL: https://issues.apache.org/jira/browse/SPARK-7263
> Project: Spark
> Issue Type: New Feature
> Components: Block Manager
> Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata
> sections
> are scattered through a Parquet file making a streaming API difficult. As
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
> between shuffle stages to improve performance in the future
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]