Re: How to get the HDFS path for each RDD

Fengdong Yu Thu, 24 Sep 2015 20:56:26 -0700

yes. such as I have two data sets:

date set A: /data/test1/dt=20100101
data set B: /data/test2/dt=20100202



all data has the same JSON format , such as:
{“key1” : “value1”, “key2” : “value2” }


my output expected:
{“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"}
{“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” : “20100202"}


> On Sep 25, 2015, at 11:52, Anchit Choudhry <[email protected]> wrote:
> 
> Sure. May I ask for a sample input(could be just few lines) and the output 
> you are expecting to bring clarity to my thoughts?
> 
> On Thu, Sep 24, 2015, 23:44 Fengdong Yu <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Anchit, 
> 
> Thanks for the quick answer.
> 
> my exact question is : I want to add HDFS location into each line in my JSON  
> data.
> 
> 
> 
>> On Sep 25, 2015, at 11:25, Anchit Choudhry <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi Fengdong,
>> 
>> Thanks for your question.
>> 
>> Spark already has a function called wholeTextFiles within sparkContext which 
>> can help you with that:
>> 
>> Python
>> hdfs://a-hdfs-path/part-00000
>> hdfs://a-hdfs-path/part-00001
>> ...
>> hdfs://a-hdfs-path/part-nnnnn
>> rdd = sparkContext.wholeTextFiles(“hdfs://a- <>hdfs-path”)
>> (a-hdfs-path/part-00000, its content)
>> (a-hdfs-path/part-00001, its content)
>> ...
>> (a-hdfs-path/part-nnnnn, its content)
>> More info: http://spark 
>> <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>> 
>> ------------
>> 
>> Scala
>> 
>> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>> 
>> More info: 
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>>  
>> Let us know if this helps or you need more help.
>> 
>> Thanks,
>> Anchit Choudhry
>> 
>> On 24 September 2015 at 23:12, Fengdong Yu <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi,
>> 
>> I have  multiple files with JSON format, such as:
>> 
>> /data/test1_data/sub100/test.data
>> /data/test2_data/sub200/test.data
>> 
>> 
>> I can sc.textFile(“/data/*/*”)
>> 
>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save 
>> it the one target HDFS location.
>> 
>> how to do it, Thanks.
>> 
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected] 
>> <mailto:[email protected]>
>> For additional commands, e-mail: [email protected] 
>> <mailto:[email protected]>
>> 
>> 
>

Re: How to get the HDFS path for each RDD

Reply via email to