Re: Review Request 30739: HIVE-9574 Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch]

Jimmy Xiang Sun, 08 Feb 2015 10:33:00 -0800


> On Feb. 7, 2015, 6:38 p.m., Xuefu Zhang wrote:
> > High-level comments:
> > 1. File name convention needs to take care of multiple threads doing the 
> > same thing in a single JVM.
> > 2. Avoid creating Tuple2 objects. There is no need to have this object, but 
> > we create them just to wrap key/value, likely increase GC pressure.
> > 3. Disk writing can be more performing. We should be able to put 
> > keys/values as bytes in a large byte[] (as the buffer). The we need to 
> > spill, we write the whole byte[] to disk in one shot.
> > 4. Related to #3, but I'm wondering why we need kryo at all.
> > 5. We need to take care of file closing in case of exceptions. The call 
> > usually goes into a finally block.
> > 6. It's better to allocate the buffer in the beginning and fail right way 
> > in case of OOM. We don't want the job to fail when only the last row needs 
> > to spill and incur OOM then.


1. As in the javadoc, HiveKVResultCache is not thread safe, each 
HiveBaseFunctionResultList instance should have its own cache. Any suggestion 
as how to change the file name?
2. The next() returns a Tuple2. We need to create a Tuple2 before returning the 
pair. Are you saying we puth the pair in buffer without creating a Tuple2?
3. Output has a buffer itself. With the current way, we don't need to create 
another buffer.
4. Kryo is not used here. We just used Input and Output. The reason is that 
they both have buffers inside. Input has a way to tell if we are done with the 
file before trying to read from it.
5. Good point, will fix that.
6. The buffer iteself doesn't need much memory. Yes, we can pre-allocate it. 
Will fix that.


- Jimmy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30739/#review71546
-----------------------------------------------------------


On Feb. 7, 2015, 3:09 a.m., Jimmy Xiang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/30739/
> -----------------------------------------------------------
> 
> (Updated Feb. 7, 2015, 3:09 a.m.)
> 
> 
> Review request for hive, Rui Li and Xuefu Zhang.
> 
> 
> Bugs: HIVE-9574
>     https://issues.apache.org/jira/browse/HIVE-9574
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Result KV cache doesn't use RowContainer any more since it has logic we don't 
> need, which is some overhead. We don't do lazy computing right away, instead 
> we wait a little till the cache is close to spill.
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java
>  78ab680 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveKVResultCache.java 
> 8ead0cb 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunction.java 
> 7a09b4d 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveMapFunctionResultList.java
>  e92e299 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 
> 070ea4d 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunctionResultList.java
>  d4ff37c 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/spark/KryoSerializer.java 
> 286816b 
>   ql/src/test/org/apache/hadoop/hive/ql/exec/spark/TestHiveKVResultCache.java 
> 0df4598 
> 
> Diff: https://reviews.apache.org/r/30739/diff/
> 
> 
> Testing
> -------
> 
> Unit test, test on cluster
> 
> 
> Thanks,
> 
> Jimmy Xiang
> 
>

Re: Review Request 30739: HIVE-9574 Lazy computing in HiveBaseFunctionResultList may hurt performance [Spark Branch]

Reply via email to