[jira] [Created] (MAPREDUCE-5820) Unable to process mongodb gridfs collection data in Hadoop Mapreduce

sivaram (JIRA) Fri, 04 Apr 2014 06:22:26 -0700

sivaram created MAPREDUCE-5820:
----------------------------------

             Summary: Unable to process mongodb gridfs collection data in 
Hadoop Mapreduce
                 Key: MAPREDUCE-5820
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5820
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task
    Affects Versions: 2.2.0
         Environment: Hadoop, Mongodb
            Reporter: sivaram



I saved a 2GB pdf file into MongoDB using GridFS. now i want process those 
GridFS collection data using Java Spark Mapreduce. previously i have 
succesfully processed mongoDB collections with Hadoop mapreduce using 
Mongo-Hadoop connector. now i'm unable to handle binary data which is coming 
from input GridFS collections.

 MongoConfigUtil.setInputURI(config, 
"mongodb://localhost:27017/pdfbooks.fs.chunks" );
 MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
 JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
            com.mongodb.hadoop.MongoInputFormat.class, Object.class,
            BSONObject.class);
 JavaRDD<String> words = mongoRDD.flatMap(new 
FlatMapFunction<Tuple2<Object,BSONObject>,
   String>() {                                
   @Override
   public Iterable<String> call(Tuple2<Object, BSONObject> arg) {   
   System.out.println(arg._2.toString());
   ...
In the above code i'm accesing fs.chunks collection as input to my mapper. so 
mapper is taking it as BsonObject. but the problem is that input BSONObject 
data is in unreadable binary format. for example the above program 
"System.out.println(arg._2.toString());" statement giving following result:

   { "_id" : { "$oid" : "533e53048f0c8bcb0b3a7ff7"} , "files_id" : { "$oid" : 
"533e5303fac7a2e2c4afea08"} , "n" : 0 , "data" : <Binary Data>}

How Do i print/access that data in readable format. Can i use GridFS Api to do 
that. if so please suggest me how to convert input BSONObject to GridFS object 
and other best ways to do...Thank you in Advance!!!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (MAPREDUCE-5820) Unable to process mongodb gridfs collection data in Hadoop Mapreduce

Reply via email to