On Mon, Feb 24, 2014 at 11:25 AM, Jinal Shah <[email protected]>wrote:
> Thanks Josh, I have a few following questions > so let's say with the default scaleFactor how much approximation should we > assume like +/- 1%? > In the worst case, it can be arbitrarily wrong (although I suppose we're bounded on the low end by zero.) The primary sources of error are a) the fact that serialized size on disk is less than (and sometimes significantly less than) Java's object overhead and b) scaleFactor may or may not accurately reflect the operations performed by the DoFn. If I was a conservative man, and in this I am, I would assume that the in-memory storage size of the data will be 2x whatever scaleFactor reports it as, at least for purposes of deciding between an in-memory vs. a reduce-side join. > How does scaleFactor affect the size of the object? > It doesn't affect it, it only reports what the developer thinks the DoFn will do to any input it receives. Sometimes this is relatively easy to determine, like if we have a FilterFn that is going to filter out half of its inputs. For an arbitrary DoFn, it's harder to do precisely. > Can this be a part of Crunch as an enhancement to the current Join > strategy? > We have generally stayed away from any sort of intelligent join strategy selection, although it's come up a couple of times during discussions on the mailing list. One of our principles is to avoid magic wherever possible and always give developers precise control over the operations performed during a pipeline, so I would want to be careful about how we proceeded w/this sort of thing. > > Thanks > Jinal > > > On Mon, Feb 24, 2014 at 1:01 PM, Josh Wills <[email protected]> wrote: > > > Ah, cool. the long getSize() method will return Crunch's estimate of the > > size of the object in bytes, but it's good to keep in mind that it's a > very > > rough approximation based on the size of the file on disk and any info we > > have about the behavior of any DoFns that are applied to the PTable when > it > > is processed, which is communicated via the scaleFactor() function on > each > > DoFn. > > > > > > On Mon, Feb 24, 2014 at 10:57 AM, Jinal Shah <[email protected] > > >wrote: > > > > > By size I meant the memory size sorry for the confusion. Like how much > > > memory will a PTable object require. Basically what I'm trying to do is > > if > > > the object is not that large and if it could fit in memory I wanted to > > > apply map-side join to optimize the join and depending on that I also > > > wanted to determine which one is smaller to use the Left join. > > > > > > > > > On Mon, Feb 24, 2014 at 12:45 PM, Josh Wills <[email protected]> > > wrote: > > > > > > > There is the length() method, which will return a PObject<Long> with > > the > > > > number of elements in the PCollection. It requires running an MR job > > > > though. > > > > > > > > J > > > > > > > > > > > > On Mon, Feb 24, 2014 at 10:03 AM, Jinal Shah < > [email protected] > > > > >wrote: > > > > > > > > > Hi, > > > > > > > > > > Is there a way possible in crunch to find the size of a particular > > > > > PCollection or PTable in whole. > > > > > > > > > > Thanks > > > > > Jinal > > > > > > > > > > > > > > > > > > > > > -- > > > > Director of Data Science > > > > Cloudera <http://www.cloudera.com> > > > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > > > > > > > > > > > > > -- > > Director of Data Science > > Cloudera <http://www.cloudera.com> > > Twitter: @josh_wills <http://twitter.com/josh_wills> > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
