[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

andrewor14 Thu, 10 Jul 2014 00:44:29 -0700

Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1351#discussion_r14752889
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -130,6 +131,47 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
         new SchemaRDD(this, JsonRDD.inferSchema(json, samplingRatio))
     
       /**
    +   * Loads a CSV file (according to RFC 4180) and returns the result as a 
[[SchemaRDD]].
    +   *
    +   * NOTE: If there are new line characters inside quoted fields this 
method may fail to
    +   * parse correctly, because the two lines may be in different 
partitions. Use
    +   * [[SQLContext#csvRDD]] to parse such files.
    +   *
    +   * @param path path to input file
    +   * @param delimiter Optional delimiter (default is comma)
    +   * @param quote Optional quote character or string (default is '"')
    +   * @param header Optional flag to indicate first line of each file is 
the header
    +   *               (default is false)
    +   */
    +  def csvFile(path: String,
    +      delimiter: String = ",",
    +      quote: String = "\"",
    +      header: Boolean = false): SchemaRDD = {
    +    val csv = sparkContext.textFile(path)
    +    csvRDD(csv, delimiter, quote, header)
    +  }
    +
    +  /**
    +   * Parses an RDD of String as a CSV (according to RFC 4180) and returns 
the result as a
    +   * [[SchemaRDD]].
    +   *
    +   * NOTE: If there are new line characters inside quoted fields, use
    +   * [[SparkContext#wholeTextFiles]] to read each file into a single 
partition.
    +   *
    +   * @param csv input RDD
    +   * @param delimiter Optional delimiter (default is comma)
    +   * @param quote Optional quote character of strig (default is '"')
    +   * @param header Optional flag to indicate first line of each file is 
the hader
    +   *               (default is false)
    +   */
    +  def csvRDD(csv: RDD[String],
    +      delimiter: String = ",",
    +      quote: String = "\"",
    --- End diff --
    
    Hi Hossein, small nit: could you format these as follows:
    ```
    def csvRDD(
        csv: RDD[String],
        ...
        header: Boolean = false): SchemaRDD = {
      new SchemaRDD(...)
    }
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

Reply via email to