Re: combitedTextFile and CombineTextInputFormat

Alexander Pivovarov Thu, 19 May 2016 10:00:06 -0700

Spark users might not know about CombineTextInputFormat. They probably
think that sc.textFile already implements the best way to read text files.


I think CombineTextInputFormat can replace regular TextInputFormat in most
of the cases.
Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
On May 19, 2016 2:43 AM, "Reynold Xin" <[email protected]> wrote:

> Users would be able to run this already with the 3 lines of code you
> supplied right? In general there are a lot of methods already on
> SparkContext and we lean towards the more conservative side in introducing
> new API variants.
>
> Note that this is something we are doing automatically in Spark SQL for
> file sources (Dataset/DataFrame).
>
>
> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <[email protected]
> > wrote:
>
>> Hello Everyone
>>
>> Do you think it would be useful to add combinedTextFile method (which
>> uses CombineTextInputFormat) to SparkContext?
>>
>> It allows one task to read data from multiple text files and control
>> number of RDD partitions by setting
>> mapreduce.input.fileinputformat.split.maxsize
>>
>>
>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>     val conf = sc.hadoopConfiguration
>>     sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>> classOf[LongWritable], classOf[Text], conf).
>>       map(pair => pair._2.toString).setName(path)
>>   }
>>
>>
>> Alex
>>
>
>

Re: combitedTextFile and CombineTextInputFormat

Reply via email to