sivaram created MAPREDUCE-5820:
----------------------------------
Summary: Unable to process mongodb gridfs collection data in
Hadoop Mapreduce
Key: MAPREDUCE-5820
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5820
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: task
Affects Versions: 2.2.0
Environment: Hadoop, Mongodb
Reporter: sivaram
I saved a 2GB pdf file into MongoDB using GridFS. now i want process those
GridFS collection data using Java Spark Mapreduce. previously i have
succesfully processed mongoDB collections with Hadoop mapreduce using
Mongo-Hadoop connector. now i'm unable to handle binary data which is coming
from input GridFS collections.
MongoConfigUtil.setInputURI(config,
"mongodb://localhost:27017/pdfbooks.fs.chunks" );
MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
com.mongodb.hadoop.MongoInputFormat.class, Object.class,
BSONObject.class);
JavaRDD<String> words = mongoRDD.flatMap(new
FlatMapFunction<Tuple2<Object,BSONObject>,
String>() {
@Override
public Iterable<String> call(Tuple2<Object, BSONObject> arg) {
System.out.println(arg._2.toString());
...
In the above code i'm accesing fs.chunks collection as input to my mapper. so
mapper is taking it as BsonObject. but the problem is that input BSONObject
data is in unreadable binary format. for example the above program
"System.out.println(arg._2.toString());" statement giving following result:
{ "_id" : { "$oid" : "533e53048f0c8bcb0b3a7ff7"} , "files_id" : { "$oid" :
"533e5303fac7a2e2c4afea08"} , "n" : 0 , "data" : <Binary Data>}
How Do i print/access that data in readable format. Can i use GridFS Api to do
that. if so please suggest me how to convert input BSONObject to GridFS object
and other best ways to do...Thank you in Advance!!!
--
This message was sent by Atlassian JIRA
(v6.2#6252)