Mailing lists don't support attachments. Is JIRA a place we can discuss this? Based on the outcome we could either classify it an improvement/bug or "Not a Problem" ?
-Prashant On Thu, Feb 21, 2013 at 7:02 PM, Aniket Mokashi <[email protected]> wrote: > Thanks Johnny. I am not sure how to post these images on mailing lists! :( > > > On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[email protected]> > wrote: > > > Hi, Aniket: > > your image is blank :) not sure if this only happens to me though. > > > > Johnny > > > > > > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[email protected]> > > wrote: > > > > > I think the email was filtered out. Resending. > > > > > > > > > ---------- Forwarded message ---------- > > > From: Aniket Mokashi <[email protected]> > > > Date: Wed, Feb 20, 2013 at 1:18 PM > > > Subject: Replicated join: is there a setting to make this better? > > > To: "[email protected]" <[email protected]> > > > > > > > > > Hi devs, > > > > > > I was looking into limitations of size/records for fragment replicated > > > join (map join) in pig. To test that I loaded a map (aka fragment) of > > longs > > > in an alias to join it with other alias which had few other columns. > > With a > > > map file of 50mb I saw GC Overheads on the mappers. I took a heap dump > of > > > mapper to look into whats causing the GC Overheads and found that its > the > > > memory footprint of fragment itself was high. > > > > > > [image: Inline image 1] > > > > > > Note, the hashmap was able to only load about 1.8 million records- > > > [image: Inline image 2] > > > Reason was that every map record has an overhead of about 1.5kb. Most > of > > > it is part of retained heap, but it needs to be garbage collected. > > > [image: Inline image 3] > > > > > > So, it turns out- > > > > > > Size of heap required by a map join from above = 1.5 KB * Number of > > > records + Size of input (uncompressed databytearray)... (assuming the > key > > > is a long). > > > > > > So, to run your replicated join, you need to satisfy following > criteria: > > > > > > *1.5 KB * Number of records + Size of input (uncompressed) < estimated > > > free memory in the mapper (total heap - io.sort.mb - some minor > constant > > > etc.)* > > > > > > Is that a right conclusion? Is there a setting/way to make this better? > > > > > > Thanks, > > > > > > Aniket > > > > > > * > > > * > > > > > > > > > > > > -- > > > "...:::Aniket:::... Quetzalco@tl" > > > > > > > > > -- > "...:::Aniket:::... Quetzalco@tl" >
