RE: SequenceFile and object reuse

jeff saremi Thu, 19 Nov 2015 05:19:25 -0800

Sandy, Ryan, Andrew
Thanks very much. I think i now understand it better.
Jeff

From: ryan.blake.willi...@gmail.com
Date: Thu, 19 Nov 2015 06:00:30 +0000
Subject: Re: SequenceFile and object reuse
To: sandy.r...@cloudera.com; jeffsar...@hotmail.com
CC: user@spark.apache.org

Hey Jeff, in addition to what Sandy said, there are two more reasons that this 
might not be as bad as it seems; I may be incorrect in my understanding though.
First, the "additional step" you're referring to is not likely to be adding any 
overhead; the "extra map" is really just materializing the data once (as 
opposed to zero times), which is what you want (assuming your access pattern 
couldn't be reformulated in the way Sandy described, i.e. where all the objects 
in a partition don't need to be in memory at the same time).
Secondly, even if this was an "extra" map step, it wouldn't add any extra 
stages to a given pipeline, being a "narrow" dependency, so it would likely be 
low-cost anyway.
Let me know if any of the above seems incorrect, thanks!
On Thu, Nov 19, 2015 at 12:41 AM Sandy Ryza <sandy.r...@cloudera.com> wrote:
Hi Jeff,
Many access patterns simply take the result of hadoopFile and use it to create 
some other object, and thus have no need for each input record to refer to a 
different object.  In those cases, the current API is more performant than an 
alternative that would create an object for each record, because it avoids the 
unnecessary overhead of creating Java objects.  As you've pointed out, this is 
at the expense of making the code more verbose when caching.
-Sandy
On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi <jeffsar...@hotmail.com> wrote:

So we tried reading a sequencefile in Spark and realized that all our records 
have ended up becoming the same.
THen one of us found this:

Note: Because Hadoop's RecordReader class re-uses the same Writable object for 
each record, directly caching the returned RDD or directly passing it to an 
aggregation or shuffle operation will create many references to the same 
object. If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you should first copy them using a map function.

Is there anyone that can shed some light on this bizzare behavior and the 
decisions behind it?
And I also would like to know if anyone's able to read a binary file and not to 
incur the additional map() as suggested by the above? What format did you use?

thanksJeff

RE: SequenceFile and object reuse

Reply via email to