[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivaram updated MAPREDUCE-5820:
-------------------------------

    Priority: Critical  (was: Major)

> Unable to process mongodb gridfs collection data in Hadoop Mapreduce
> --------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5820
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5820
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: task
>    Affects Versions: 2.2.0
>         Environment: Hadoop, Mongodb
>            Reporter: sivaram
>            Priority: Critical
>
> I saved a 2GB pdf file into MongoDB using GridFS. now i want process those 
> GridFS collection data using Java Spark Mapreduce. previously i have 
> succesfully processed mongoDB collections with Hadoop mapreduce using 
> Mongo-Hadoop connector. now i'm unable to handle binary data which is coming 
> from input GridFS collections.
>  MongoConfigUtil.setInputURI(config, 
> "mongodb://localhost:27017/pdfbooks.fs.chunks" );
>  MongoConfigUtil.setOutputURI(config,"mongodb://localhost:27017/"+output );
>  JavaPairRDD<Object, BSONObject> mongoRDD = sc.newAPIHadoopRDD(config,
>             com.mongodb.hadoop.MongoInputFormat.class, Object.class,
>             BSONObject.class);
>  JavaRDD<String> words = mongoRDD.flatMap(new 
> FlatMapFunction<Tuple2<Object,BSONObject>,
>    String>() {                                
>    @Override
>    public Iterable<String> call(Tuple2<Object, BSONObject> arg) {   
>    System.out.println(arg._2.toString());
>    ...
> In the above code i'm accesing fs.chunks collection as input to my mapper. so 
> mapper is taking it as BsonObject. but the problem is that input BSONObject 
> data is in unreadable binary format. for example the above program 
> "System.out.println(arg._2.toString());" statement giving following result:
>    { "_id" : { "$oid" : "533e53048f0c8bcb0b3a7ff7"} , "files_id" : { "$oid" : 
> "533e5303fac7a2e2c4afea08"} , "n" : 0 , "data" : <Binary Data>}
> How Do i print/access that data in readable format. Can i use GridFS Api to 
> do that. if so please suggest me how to convert input BSONObject to GridFS 
> object and other best ways to do...Thank you in Advance!!!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to