The underlying buffer for UnsafeRow is reused in UnsafeProjection. On Thu, Mar 3, 2016 at 9:11 PM, Rishi Mishra <rmis...@snappydata.io> wrote: > Hi Davies, > When you say "UnsafeRow could come from UnsafeProjection, so We should copy > the rows for safety." do you intend to say that the underlying state might > change , because of some state update APIs ? > Or its due to some other rationale ? > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > https://in.linkedin.com/in/rishiteshmishra > > On Thu, Mar 3, 2016 at 3:59 AM, Davies Liu <dav...@databricks.com> wrote: >> >> I see, we could reduce the memory by moving the copy out of the >> HashedRelation, >> then we should do the copy before call HashedRelation for shuffle hash >> join. >> >> Another things is that when we do broadcasting, we will have another >> serialized copy >> of hash table. >> >> For the table that's larger than 100M, we may not suggest to use Broadcast >> join, >> because it take time to send it to every executor also take the same >> amount of >> memory on every executor. >> >> On Wed, Mar 2, 2016 at 10:45 AM, Matt Cheah <mch...@palantir.com> wrote: >> > I would expect the memory pressure to grow because not only are we >> > storing >> > the backing array to the iterator of the rows on the driver, but we’re >> > also storing a copy of each of those rows in the hash table. Whereas if >> > we >> > didn’t do the copy on the drive side then the hash table would only have >> > to store pointers to those rows in the array. Perhaps we can think about >> > whether or not we want to be using the HashedRelation constructs in >> > broadcast join physical plans? >> > >> > The file isn’t compressed - it was a 150MB CSV living on HDFS. I would >> > expect it to fit in a 1GB heap, but I agree that it is difficult to >> > reason >> > about dataset size on disk vs. memory. >> > >> > -Matt Cheah >> > >> > On 3/2/16, 10:15 AM, "Davies Liu" <dav...@databricks.com> wrote: >> > >> >>UnsafeHashedRelation and HashedRelation could also be used in Executor >> >>(for non-broadcast hash join), then the UnsafeRow could come from >> >>UnsafeProjection, >> >>so We should copy the rows for safety. >> >> >> >>We could have a smarter copy() for UnsafeRow (avoid the copy if it's >> >>already copied), >> >>but I don't think this copy here will increase the memory pressure. >> >>The total memory >> >>will be determined by how many rows are stored in the hash tables. >> >> >> >>In general, if you do not have enough memory, just don't increase >> >>autoBroadcastJoinThreshold, >> >>or the performance could be worse because of full GC. >> >> >> >>Sometimes the tables looks small as compressed files (for example, >> >>parquet file), >> >>once it's loaded into memory, it could required much more memory than >> >> the >> >>size >> >>of file on disk. >> >> >> >> >> >>On Tue, Mar 1, 2016 at 5:17 PM, Matt Cheah <mch...@palantir.com> wrote: >> >>> Hi everyone, >> >>> >> >>> I had a quick question regarding our implementation of >> >>>UnsafeHashedRelation >> >>> and HashedRelation. It appears that we copy the rows that we’ve >> >>>collected >> >>> into memory upon inserting them into the hash table in >> >>> UnsafeHashedRelation#apply(). I was wondering why we are copying the >> >>>rows >> >>> every time? I can’t imagine these rows being mutable in this scenario. >> >>> >> >>> The context is that I’m looking into a case where a small data frame >> >>>should >> >>> fit in the driver’s memory, but my driver ran out of memory after I >> >>> increased the autoBroadcastJoinThreshold. YourKit is indicating that >> >>>this >> >>> logic is consuming more memory than my driver can handle. >> >>> >> >>> Thanks, >> >>> >> >>> -Matt Cheah >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >
--------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org