>From my understanding I think newAPIHadoopFile or hadoopFIle is generic enough for you to support any InputFormat you wanted. IMO it is not so necessary to add a new API for this.
On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov <apivova...@gmail.com> wrote: > Spark users might not know about CombineTextInputFormat. They probably > think that sc.textFile already implements the best way to read text files. > > I think CombineTextInputFormat can replace regular TextInputFormat in most > of the cases. > Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ? > On May 19, 2016 2:43 AM, "Reynold Xin" <r...@databricks.com> wrote: > >> Users would be able to run this already with the 3 lines of code you >> supplied right? In general there are a lot of methods already on >> SparkContext and we lean towards the more conservative side in introducing >> new API variants. >> >> Note that this is something we are doing automatically in Spark SQL for >> file sources (Dataset/DataFrame). >> >> >> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov < >> apivova...@gmail.com> wrote: >> >>> Hello Everyone >>> >>> Do you think it would be useful to add combinedTextFile method (which >>> uses CombineTextInputFormat) to SparkContext? >>> >>> It allows one task to read data from multiple text files and control >>> number of RDD partitions by setting >>> mapreduce.input.fileinputformat.split.maxsize >>> >>> >>> def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = { >>> val conf = sc.hadoopConfiguration >>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat], >>> classOf[LongWritable], classOf[Text], conf). >>> map(pair => pair._2.toString).setName(path) >>> } >>> >>> >>> Alex >>> >> >>