Spark users might not know about CombineTextInputFormat. They probably think that sc.textFile already implements the best way to read text files.
I think CombineTextInputFormat can replace regular TextInputFormat in most of the cases. Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ? On May 19, 2016 2:43 AM, "Reynold Xin" <r...@databricks.com> wrote: > Users would be able to run this already with the 3 lines of code you > supplied right? In general there are a lot of methods already on > SparkContext and we lean towards the more conservative side in introducing > new API variants. > > Note that this is something we are doing automatically in Spark SQL for > file sources (Dataset/DataFrame). > > > On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <apivova...@gmail.com > > wrote: > >> Hello Everyone >> >> Do you think it would be useful to add combinedTextFile method (which >> uses CombineTextInputFormat) to SparkContext? >> >> It allows one task to read data from multiple text files and control >> number of RDD partitions by setting >> mapreduce.input.fileinputformat.split.maxsize >> >> >> def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = { >> val conf = sc.hadoopConfiguration >> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat], >> classOf[LongWritable], classOf[Text], conf). >> map(pair => pair._2.toString).setName(path) >> } >> >> >> Alex >> > >