Re: How to get the HDFS path for each RDD

Anchit Choudhry Thu, 24 Sep 2015 20:52:53 -0700

Sure. May I ask for a sample input(could be just few lines) and the output
you are expecting to bring clarity to my thoughts?


On Thu, Sep 24, 2015, 23:44 Fengdong Yu <[email protected]> wrote:

> Hi Anchit,
>
> Thanks for the quick answer.
>
> my exact question is : I want to add HDFS location into each line in my
> JSON  data.
>
>
>
> On Sep 25, 2015, at 11:25, Anchit Choudhry <[email protected]>
> wrote:
>
> Hi Fengdong,
>
> Thanks for your question.
>
> Spark already has a function called wholeTextFiles within sparkContext
> which can help you with that:
>
> Python
>
> hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001
> ...hdfs://a-hdfs-path/part-nnnnn
>
> rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
>
> (a-hdfs-path/part-00000, its content)
> (a-hdfs-path/part-00001, its content)
> ...
> (a-hdfs-path/part-nnnnn, its content)
>
> More info: http://spark.apache.org/docs/latest/api/python/pyspark
> .html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>
> ------------
>
> Scala
>
> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>
> More info: https://spark.apache.org/docs/latest/api/scala/index.html#org.
> apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>
> Let us know if this helps or you need more help.
>
> Thanks,
> Anchit Choudhry
>
> On 24 September 2015 at 23:12, Fengdong Yu <[email protected]>
> wrote:
>
>> Hi,
>>
>> I have  multiple files with JSON format, such as:
>>
>> /data/test1_data/sub100/test.data
>> /data/test2_data/sub200/test.data
>>
>>
>> I can sc.textFile(“/data/*/*”)
>>
>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then
>> save it the one target HDFS location.
>>
>> how to do it, Thanks.
>>
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>

Re: How to get the HDFS path for each RDD

Reply via email to