Re: combitedTextFile and CombineTextInputFormat

Saisai Shao Thu, 19 May 2016 20:31:11 -0700

>From my understanding I think newAPIHadoopFile or hadoopFIle is generic
enough for you to support any InputFormat you wanted. IMO it is not so
necessary to add a new API for this.


On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Spark users might not know about CombineTextInputFormat. They probably
> think that sc.textFile already implements the best way to read text files.
>
> I think CombineTextInputFormat can replace regular TextInputFormat in most
> of the cases.
> Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
> On May 19, 2016 2:43 AM, "Reynold Xin" <r...@databricks.com> wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>>     val conf = sc.hadoopConfiguration
>>>     sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>       map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>

Re: combitedTextFile and CombineTextInputFormat

Reply via email to