Sure. May I ask for a sample input(could be just few lines) and the output you are expecting to bring clarity to my thoughts?
On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdo...@everstring.com> wrote: > Hi Anchit, > > Thanks for the quick answer. > > my exact question is : I want to add HDFS location into each line in my > JSON data. > > > > On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com> > wrote: > > Hi Fengdong, > > Thanks for your question. > > Spark already has a function called wholeTextFiles within sparkContext > which can help you with that: > > Python > > hdfs://a-hdfs-path/part-00000hdfs://a-hdfs-path/part-00001 > ...hdfs://a-hdfs-path/part-nnnnn > > rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”) > > (a-hdfs-path/part-00000, its content) > (a-hdfs-path/part-00001, its content) > ... > (a-hdfs-path/part-nnnnn, its content) > > More info: http://spark.apache.org/docs/latest/api/python/pyspark > .html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles > > ------------ > > Scala > > val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path") > > More info: https://spark.apache.org/docs/latest/api/scala/index.html#org. > apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)] > > Let us know if this helps or you need more help. > > Thanks, > Anchit Choudhry > > On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com> > wrote: > >> Hi, >> >> I have multiple files with JSON format, such as: >> >> /data/test1_data/sub100/test.data >> /data/test2_data/sub200/test.data >> >> >> I can sc.textFile(“/data/*/*”) >> >> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then >> save it the one target HDFS location. >> >> how to do it, Thanks. >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> > >