[GitHub] spark pull request #21287: [SPARK-1849][Core]Add encoding customization supp...

HyukjinKwon Thu, 10 May 2018 01:57:07 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21287#discussion_r187275950
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
    @@ -816,14 +816,25 @@ class SparkContext(config: SparkConf) extends Logging 
{
        * Hadoop-supported file system URI, and return it as an RDD of Strings.
        * @param path path to the text file on a supported file system
        * @param minPartitions suggested minimum number of partitions for the 
resulting RDD
    +   * @param charsetName name of the charset to be used to decode the bytes 
from the text file
        * @return RDD of lines of the text file
        */
       def textFile(
           path: String,
    -      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    +      minPartitions: Int = defaultMinPartitions,
    +      charsetName: String = null): RDD[String] = withScope {
         assertNotStopped()
    -    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text],
    -      minPartitions).map(pair => pair._2.toString).setName(path)
    +    val textRdd = hadoopFile(path, classOf[TextInputFormat], 
classOf[LongWritable], classOf[Text],
    +      minPartitions).map(pair => pair._2)
    +    val stringRdd = if (charsetName == null) {
    +      textRdd.map(text => text.toString)
    +    } else {
    +      // Text.getBytes returns the internal memory buffer which size 
usually larger than the actual
    +      // bytes count of the string content for performance reasons. We 
should be careful to limit
    +      // the index range correctly within [0, Text.getLength)
    +      textRdd.map(text => new String(text.getBytes, 0, text.getLength, 
charsetName))
    --- End diff --
    
    Yea but I meant to have it in data source like text data source. 
`textinputformat.record.delimiter` should be manually set and it's a bit odd 
BTW. SparkContext is super old API and I wonder if we'd extend the support for 
this usually. I would just focus on fixing it in text datasource side.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21287: [SPARK-1849][Core]Add encoding customization supp...

Reply via email to