Re: combitedTextFile and CombineTextInputFormat

Xiangrui Meng Thu, 19 May 2016 08:45:13 -0700

Not exacly the same as the one you suggested but you can chain it with
flatMap to get what you want, if each file is not huge.


On Thu, May 19, 2016, 8:41 AM Xiangrui Meng <men...@gmail.com> wrote:

> This was implemented as sc.wholeTextFiles.
>
> On Thu, May 19, 2016, 2:43 AM Reynold Xin <r...@databricks.com> wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>>     val conf = sc.hadoopConfiguration
>>>     sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>       map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>

Re: combitedTextFile and CombineTextInputFormat

Reply via email to