[
https://issues.apache.org/jira/browse/MAPREDUCE-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
sivaram updated MAPREDUCE-5820:
-------------------------------
Priority: Critical (was: Major)
> Unable to process mongodb gridfs collection data in Hadoop Mapreduce
> --------------------------------------------------------------------
>
> Key: MAPREDUCE-5820
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5820
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: task
> Affects Versions: 2.2.0
> Environment: Hadoop, Mongodb
> Reporter: sivaram
> Priority: Critical
>
> I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
> GridFS collection data using Java Spark Mapreduce. previously i have
> succesfully processed mongoDB collections with Hadoop mapreduce using
> Mongo-Hadoop connector. now i'm unable to handle binary data which is coming
> from input GridFS collections.
> MongoConfigUtil.setInputURI(config,
> "mongodb://localhost:27017/pdfbooks.fs.chunks" );
> MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
> JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
> com.mongodb.hadoop.MongoInputFormat.class, Object.class,
> BSONObject.class);
> JavaRDD<String> words = mongoRDD.flatMap(new
> FlatMapFunction<Tuple2<Object,BSONObject>,
> String>() {
> @Override
> public Iterable<String> call(Tuple2<Object, BSONObject> arg) {
> System.out.println(arg._2.toString());
> ...
> In the above code i'm accesing fs.chunks collection as input to my mapper. so
> mapper is taking it as BsonObject. but the problem is that input BSONObject
> data is in unreadable binary format. for example the above program
> "System.out.println(arg._2.toString());" statement giving following result:
> { "_id" : { "$oid" : "533e53048f0c8bcb0b3a7ff7"} , "files_id" : { "$oid" :
> "533e5303fac7a2e2c4afea08"} , "n" : 0 , "data" : <Binary Data>}
> How Do i print/access that data in readable format. Can i use GridFS Api to
> do that. if so please suggest me how to convert input BSONObject to GridFS
> object and other best ways to do...Thank you in Advance!!!
--
This message was sent by Atlassian JIRA
(v6.2#6252)