RE: SequenceFile and object reuse

jeff saremi Wed, 18 Nov 2015 21:36:05 -0800

You're not seeing the issue because you perform one additional "map". 
map{case (k,v) => (k.get(), v.toString)}Instead of being able to use the read 
Text you had to create a tuple (single) out of the string of the text.

That is exactly why I asked this question.Why do we have t do this additional 
processing? What is the rationale behind it?
Is there other ways of reading a hadoop file (or any other file) that would not 
incur this additional step?thanks

Date: Thu, 19 Nov 2015 13:26:31 +0800
Subject: Re: FW: SequenceFile and object reuse
From: zjf...@gmail.com
To: jeffsar...@hotmail.com
CC: dev@spark.apache.org

Would this be an issue on the raw data ? I use the following simple code, and 
don't hit the issue you mentioned. Or it would be better to share your code. 
val rdd =sc.sequenceFile("/Users/hadoop/Temp/Seq", classOf[IntWritable], 
classOf[Text])
rdd.map{case (k,v) => (k.get(), v.toString)}.collect() foreach println
On Thu, Nov 19, 2015 at 12:04 PM, jeff saremi <jeffsar...@hotmail.com> wrote:

I sent this to the user forum. I got no responses. Could someone here please 
help? thanks
jeff

From: jeffsar...@hotmail.com
To: u...@spark.apache.org
Subject: SequenceFile and object reuse
Date: Fri, 13 Nov 2015 13:29:58 -0500

So we tried reading a sequencefile in Spark and realized that all our records 
have ended up becoming the same.
THen one of us found this:

Note: Because Hadoop's RecordReader class re-uses the same Writable object for 
each record, directly caching the returned RDD or directly passing it to an 
aggregation or shuffle operation will create many references to the same 
object. If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you should first copy them using a map function.

Is there anyone that can shed some light on this bizzare behavior and the 
decisions behind it?
And I also would like to know if anyone's able to read a binary file and not to 
incur the additional map() as suggested by the above? What format did you use?

thanksJeff                                                                      

-- 
Best Regards

Jeff Zhang

RE: SequenceFile and object reuse

Reply via email to