Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16137#discussion_r91855254
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -956,24 +976,24 @@ class SparkContext(config: SparkConf) extends Logging 
{
       }
     
       /**
    -   * Get an RDD for a Hadoop-readable dataset from a Hadoop JobConf given 
its InputFormat and other
    -   * necessary info (e.g. file name for a filesystem-based dataset, table 
name for HyperTable),
    -   * using the older MapReduce API (`org.apache.hadoop.mapred`).
    +   * Get an RDD for a Hadoop-readable dataset from a Hadoop `JobConf` 
given its `InputFormat`
    +   * and other necessary info (e.g. file name for a filesystem-based 
dataset, table name
    +   * for HyperTable), using the older MapReduce API 
(`org.apache.hadoop.mapred`).
        *
    -   * @param conf JobConf for setting up the dataset. Note: This will be 
put into a Broadcast.
    +   * @note Because Hadoop's `RecordReader` class re-uses the same Writable 
object for each
    +   * record, directly caching the returned RDD or directly passing it to 
an aggregation
    +   * or shuffle operation will create many references to the same object.
    +   * If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you
    +   * should first copy them using a `map` function.
    +   * @param conf `JobConf` for setting up the dataset. Note: This will be 
put into a Broadcast.
        *             Therefore if you plan to reuse this conf to create 
multiple RDDs, you need to make
        *             sure you won't modify the conf. A safe approach is always 
creating a new conf for
        *             a new RDD.
        * @param inputFormatClass Class of the InputFormat
        * @param keyClass Class of the keys
        * @param valueClass Class of the values
    -   * @param minPartitions Minimum number of Hadoop Splits to generate.
    -   *
    --- End diff --
    
    As before, don't move this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to