[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

pat chan (JIRA) Mon, 03 Jun 2013 17:40:47 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673876#comment-13673876
 ]


pat chan commented on PIG-3341:
-------------------------------

I was looking in the docs for any documentation on this topic. I found the 
following in http://wiki.apache.org/pig/UDFManual

<quote>
The first thing to decide is what to do with invalid data. This depends on the 
format of the data. If the data is of type bytearray it means that it has not 
yet been converted to its proper type. In this case, if the format of the data 
does not match the expected type, a null value should be returned. If, on the 
other hand, the input data is of another type, this means that the conversion 
has already happened and the data should be in the correct format. This is the 
case with our example and that's why it throws an error (line 16.) Note that 
WrappedIOException is a helper class to convert the actual exception to an 
IOException.

Also, note that lines 10-11 check if the input data is null or empty and if so 
returns null.
</quote>

If I'm reading this correctly, it says that if the type of the input doesn't 
match the signature of the UDF, a null should be returned. However, I get this:

  grunt> A = load 'o' as (a:bytearray);
  grunt> B = foreach A generate ToDate(a); dump B;
  2013-06-03 17:15:09,253 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1046: 
  <line 2, column 23> Multiple matching functions for 
org.apache.pig.builtin.ToDate with input schema: ({long}, {chararray}). Please 
use an explicit cast.

It also seems to be saying that if the types are right and the format is 
invalid, an error should be thrown. I just checked and yes, I get an error. 
However, this doesn't match Rohini's proposal to return a null instead. Also, 
as Dmitriy hinted, it's not philosophically consistent with loading behavior 
where invalid things turn into nulls.

  2013-06-03 17:25:12,977 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
  2013-06-03 17:25:12,981 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1066: Unable to open iterator for alias B


BTW, the note about lines 10-11 isn't quite right. The code in the example 
doesn't have a check for null and so a null would cause an exception.

                
> Improving performance of loading datetime values
> ------------------------------------------------
>
>                 Key: PIG-3341
>                 URL: https://issues.apache.org/jira/browse/PIG-3341
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.11.1
>            Reporter: pat chan
>            Priority: Minor
>             Fix For: 0.12, 0.11.2
>
>
> The performance of loading datetime values can be improved by about 25% by 
> moving a single line in ToDate.java:
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
>       Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");;
> should become:
>     static Pattern pattern = 
> Pattern.compile("(Z|(?<=(T[0-9\\.:]{0,12}))((\\+|-)\\d{2}(:?\\d{2})?))$");
>     public static DateTimeZone extractDateTimeZone(String dtStr) {
> There is no need to recompile the regular expression for every value. I'm not 
> sure if this function is ever called concurrently, but Pattern objects are 
> thread-safe anyways.
> As a test, I created a file of 10M timestamps:
>   for i in 0..10000000
>     puts '2000-01-01T00:00:00+23'
>   end
> I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is null; dump B;
> Before the change it took 160s.
> After the change, the script took 120s.
> ----------------
> Another performance improvement can be made for invalid datetime values. If a 
> datetime value is invalid, an exception is created and thrown, which is a 
> costly way to fail a validity check. To test the performance impact, I 
> created 10M invalid datetime values:
>   for i in 0..10000000
>     puts '2000-99-01T00:00:00+23'
>   end
> In this test, the regex pattern was always recompiled. I then ran this script:
>   grunt> A = load 'data' as (a:datetime); B = filter A by a is not null; dump 
> B;
> The script took 190s.
> I understand this could be considered an edge case and might not be worth 
> changing. However, if there are use cases where invalid dates are part of 
> normal processing, then you might consider fixing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3341) Improving performance of loading datetime values

Reply via email to