Hello Everyone

Do you think it would be useful to add combinedTextFile method (which uses
CombineTextInputFormat) to SparkContext?

It allows one task to read data from multiple text files and control number
of RDD partitions by setting
mapreduce.input.fileinputformat.split.maxsize


  def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
    val conf = sc.hadoopConfiguration
    sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
classOf[LongWritable], classOf[Text], conf).
      map(pair => pair._2.toString).setName(path)
  }


Alex

Reply via email to