[GitHub] spark pull request #16137: [SPARK-18708][CORE] Improvement/improve docs in s...

srowen Sun, 11 Dec 2016 04:59:16 -0800

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16137#discussion_r91855251
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -956,24 +976,24 @@ class SparkContext(config: SparkConf) extends Logging 
{
       }
     
       /**
    -   * Get an RDD for a Hadoop-readable dataset from a Hadoop JobConf given 
its InputFormat and other
    -   * necessary info (e.g. file name for a filesystem-based dataset, table 
name for HyperTable),
    -   * using the older MapReduce API (`org.apache.hadoop.mapred`).
    +   * Get an RDD for a Hadoop-readable dataset from a Hadoop `JobConf` 
given its `InputFormat`
    +   * and other necessary info (e.g. file name for a filesystem-based 
dataset, table name
    +   * for HyperTable), using the older MapReduce API 
(`org.apache.hadoop.mapred`).
        *
    -   * @param conf JobConf for setting up the dataset. Note: This will be 
put into a Broadcast.
    +   * @note Because Hadoop's `RecordReader` class re-uses the same Writable 
object for each
    +   * record, directly caching the returned RDD or directly passing it to 
an aggregation
    +   * or shuffle operation will create many references to the same object.
    +   * If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you
    +   * should first copy them using a `map` function.
    +   * @param conf `JobConf` for setting up the dataset. Note: This will be 
put into a Broadcast.
        *             Therefore if you plan to reuse this conf to create 
multiple RDDs, you need to make
        *             sure you won't modify the conf. A safe approach is always 
creating a new conf for
        *             a new RDD.
        * @param inputFormatClass Class of the InputFormat
        * @param keyClass Class of the keys
        * @param valueClass Class of the values
    -   * @param minPartitions Minimum number of Hadoop Splits to generate.
    -   *
    -   * @note Because Hadoop's RecordReader class re-uses the same Writable 
object for each
    -   * record, directly caching the returned RDD or directly passing it to 
an aggregation or shuffle
    -   * operation will create many references to the same object.
    -   * If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you should first
    -   * copy them using a `map` function.
    +   * @param minPartitions minimum number of Hadoop Splits to generate.
    --- End diff --
    
    Say "partitions" not "splits" despite what the existing string says



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16137: [SPARK-18708][CORE] Improvement/improve docs in s...

Reply via email to