Re: Review Request: Vectorized Timestamp functions for long nanosecond based timestamps

Gopal V Fri, 31 May 2013 06:58:49 -0700


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFTimestampFieldLong.java,
> >  line 56
> > <https://reviews.apache.org/r/11530/diff/1/?file=298364#file298364line56>
> >
> >     If you know in what cases ms can be negative, can you add that to the 
> > comment? That seems unusual. 
> >     
> >     In think this because if the time is negative (before the epoch) you 
> > could get negative nanos. So you want to convert before creating the 
> > timestamp. Is that right? Please elaborate a little in the comment.


The issue has to do with the way sql.Timestamp stores values.

http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/sql/Timestamp.java#126


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFTimestampFieldLong.java,
> >  line 62
> > <https://reviews.apache.org/r/11530/diff/1/?file=298364#file298364line62>
> >
> >     Would it be possible to re-use a timestamp that belongs to the class 
> > here, rather than calling new? If so, please do that to speed this up. I 
> > think you can do what you need with setTime() and setNanos().
> >     
> >     Eliminating new() in the inner loop of vector processing tends to speed 
> > things up a lot.

Will remove the extra timestamp created here.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFTimestampFieldLong.java,
> >  line 51
> > <https://reviews.apache.org/r/11530/diff/1/?file=298364#file298364line51>
> >
> >     Can you confirm this expression will be constant-folded by the 
> > compiler? Otherwise this should be evaluated by hand in advance.

in bytecode it turns into the appropriate 2 integer division ops.
   6:   lload_1
   7:   ldc2_w  #3; //long 1000000000l
   10:  ldiv
   11:  ldc2_w  #5; //long 1000l
   14:  ldiv


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFWeekOfYearLong.java,
> >  line 29
> > <https://reviews.apache.org/r/11530/diff/1/?file=298366#file298366line29>
> >
> >     Please explain why constant 4 is correct here and what it means.

This copied over from UDFWeekOfYear, which also leaves it undocumented.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFYearLong.java,
> >  line 33
> > <https://reviews.apache.org/r/11530/diff/1/?file=298367#file298367line33>
> >
> >     can you use minYear and maxYear here instead of literals?

will do.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFYearLong.java,
> >  line 43
> > <https://reviews.apache.org/r/11530/diff/1/?file=298367#file298367line43>
> >
> >     Given that this function is moderately heavy anyway (with the binary 
> > search) I think making it virtual will not slow things down much. 
> >     
> >     But if it gets faster we should seriously consider creating a separate 
> > evaluate method for VectorUDFYearLong and make this a static function to 
> > avoid virtual method call overhead.
> >

Converting all leaf UDFs to be final classes, this triggers the virtual method 
cache and also makes the parent methods (from an abstract class) valid for 
inlining.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFYearLong.java,
> >  line 45
> > <https://reviews.apache.org/r/11530/diff/1/?file=298367#file298367line45>
> >
> >     Did you consider calculating approximate year using something like
> >     
> >     approxYear = yearBase + (int)(time / nanosPerYear) - 1; 
> >     
> >     and then linearly search forward to find year boundary? I wonder if 
> > that would be faster than binary search.

A probing hash lookup will incur a bounds check operation in java, which is not 
triggered at all for Arrays functions or anything which uses 
GetPrimitiveArrayCritical (in the same thread) which skips boundary checks.

Could be done, but for now readability by others is preferred till first 
integration.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorTimestampExpressions.java,
> >  line 101
> > <https://reviews.apache.org/r/11530/diff/1/?file=298368#file298368line101>
> >
> >     if inputs.length == 1 (which I think it often does in your tests), then 
> > I % 1 is always 0. So you are always loading up input vectors with all 0. 
> > Is there a reason for this? If so, please explain, or if not, consider a 
> > wider range of inputs.

It tests a case where .isRepeating is not true, but all items are initialized 
to the exact same value. The % was to repeat smaller batches to fill up the 
entire batch size.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorTimestampExpressions.java,
> >  line 109
> > <https://reviews.apache.org/r/11530/diff/1/?file=298368#file298368line109>
> >
> >     what does /*begin-macro*/ mean?

I used a home-made macro operation to generate all the test-cases to be 
identical with different UDFs

import sys, re;

data = sys.stdin.read();
macro = re.compile(r'begin-macro(.*)end-macro',re.M|re.S);
m = macro.search(data);
content = "\n".join(m.group(1).split("\n")[1:-1]);
occurrance = re.compile(r's\/([^/]*)\/([^/]*)\/g.*');
def expand(m):
        original = m.group(1);
        replacement = m.group(2);
        return "%s\n%s" % (m.group(0), 
  content.replace(original, replacement));
print re.sub(occurrance, expand, data);


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorTimestampExpressions.java,
> >  line 117
> > <https://reviews.apache.org/r/11530/diff/1/?file=298368#file298368line117>
> >
> >     I'm trying to standardize on using "batch" instead of "vrg" for 
> > batches. vrg is used in a lot of places but g stands for group not batch so 
> > it is confusing.

will change it, this was copied off the String UDF tests.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorTimestampExpressions.java,
> >  line 573
> > <https://reviews.apache.org/r/11530/diff/1/?file=298368#file298368line573>
> >
> >     here you know !(vrg.cols[in].noNulls || vrg.cols[in].isNull[i] == 
> > false) ==> 
> >     !noNulls and isNull[I] == true
> >     
> >     But you are checking for the value of isNull[I] for "in" in the assert 
> > which is a bit misleading. 
> >     
> >     Maybe test against true in the else branch?

The base test is Assert.assertEquals(vrg.cols[out].isNull[i], 
vrg.cols[in].isNull[i]); for both the sides of the if/else.

The logical extension being that vrg.cols[in].isNull[i] == true, I expect 
vrg.cols[out].isNull[i] == true.

It felt better to not simplify the logic and keep the assert as an invariant 
between the branches.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorTimestampExpressions.java,
> >  line 97
> > <https://reviews.apache.org/r/11530/diff/1/?file=298368#file298368line97>
> >
> >     Can you comment this function to explain how you are using long[] 
> > inputs? I think I understand but a comment would help.
> >     
> >

The input is repeated to fill up size times. I will add a comment.


> On May 30, 2013, 9:29 p.m., Eric Hanson wrote:
> > ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorTimestampExpressions.java,
> >  line 1
> > <https://reviews.apache.org/r/11530/diff/1/?file=298368#file298368line1>
> >
> >     Overall this is a good set of unit tests! It's quite comprehensive. 
> > Thanks.

Thanks, always makes sense to write more tests than impl, when reimplementing 
identical features.


- Gopal


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/11530/#review21203
-----------------------------------------------------------


On May 29, 2013, 11:36 p.m., Gopal V wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/11530/
> -----------------------------------------------------------
> 
> (Updated May 29, 2013, 11:36 p.m.)
> 
> 
> Review request for hive, Jitendra Pandey and Eric Hanson.
> 
> 
> Description
> -------
> 
> Timestamp UDFs for vectorized long nanosecond timestamps - all of them 
> convert timestamp into a corresponding long/int value (year, month, 
> week-of-year, day-of-month, hour, minute, second, unix-timestamp).
> 
> 
> This addresses bug HIVE-4608.
>     https://issues.apache.org/jira/browse/HIVE-4608
> 
> 
> Diffs
> -----
> 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFDayOfMonthLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFHourLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFMinuteLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFMonthLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFSecondLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFTimestampFieldLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFUnixTimeStampLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFWeekOfYearLong.java
>  PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/exec/vector/expressions/VectorUDFYearLong.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorTimestampExpressions.java
>  PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/11530/diff/
> 
> 
> Testing
> -------
> 
> Unit tests included which compare each UDF against its non-vectorized one's 
> output, with random data and year boundary data (+1,0,-1).
> 
> 
> Thanks,
> 
> Gopal V
> 
>

Re: Review Request: Vectorized Timestamp functions for long nanosecond based timestamps

Reply via email to