pig-user  

Re: piggybank apachelogparser.DateExtractor problem

Dmitriy Ryaboy
Wed, 17 Mar 2010 08:31:25 -0700

Yeah that's weird. Especially the wrong constructor being called. Could you
open a ticket please?

On Wed, Mar 17, 2010 at 8:11 AM, Johannes Rußek <
johannes.rus...@io-consulting.net> wrote:

> Hi David!
> Thanks a lot for your detailed answer, i will try to use your UDF :)
> What bothers me though is that it appears that the DateExtractor had worked
> like we expected at some point in time, since the docs say to use it like
> that and i could find a bunch of blog posts using it with the format in the
> constructor..
> Thanks anyway :)
> Johannes
>
> Am 17.03.2010 10:07, schrieb David Vrensk:
>
>  On Tue, Mar 16, 2010 at 19:58, Johannes Rußek<
>> johannes.rus...@io-consulting.net>  wrote:
>>
>>
>>
>>> Hello everybody,
>>> I've been trying to use
>>> org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
>>> from
>>> piggybank that comes with pig 0.6.0, but i don't seem to be able to set
>>> the
>>> output format.
>>> whatever i use as the argument in the construct:
>>>
>>> DEFINE MyDateExtractor
>>>
>>> org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('HH:mm:ss');
>>>
>>> i only ever get yyyy-MM-dd back.
>>> however, when i change DEFAULT_OUTGOING_DATE_FORMAT in
>>>
>>> main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser/DateExtractor.java
>>> to something like 'yyyy-MM-dd-HH' it is able to output the right format.
>>> Am i doing something wrong?
>>>
>>>
>>>
>> I don't think so.  I ran into the same problem a couple of weeks ago, and
>> played around with the code inserting some print/log statements.  It turns
>> out that the arguments are only used in the initial constructor calls,
>> when
>> the pig process is starting, but once pig reaches the point where it would
>> use the udf, it creates new DateExtractors without passing the arguments.
>>
>> I found two ways around this:
>>
>> 1. Let the initial calls to the constructor store the format in a static
>> variable.  This is brittle.
>> 2. Supply a date format with the actual calls.  This is what I ended up
>> doing (in my own DateExtractor that I created in my own UDF lib).  The end
>> result looks like this:
>>
>>     public DateExtractor() {}
>>
>>     @Override
>>     public String exec(Tuple input) throws IOException {
>>         if (input == null || input.size() == 0)
>>             return null;
>>
>>         DateFormat incomingDateFormat = defaultIncomingDateFormat;
>>         DateFormat outgoingDateFormat = defaultOutgoingDateFormat;
>>         if (input.size()>  1) {
>>             outgoingDateFormat = new
>> SimpleDateFormat((String)input.get(1));
>>             outgoingDateFormat.setTimeZone(gmt);
>>         }
>>         if (input.size()>  2) {
>>             incomingDateFormat = new
>> SimpleDateFormat((String)input.get(2));
>>             incomingDateFormat.setTimeZone(gmt);
>>         }
>>
>>         String str="";
>>         try {
>>             str = (String)input.get(0);
>>             Date date = incomingDateFormat.parse(str);
>>             return outgoingDateFormat.format(date);
>>
>>         } catch (ParseException pe) {
>>             System.err.println("releware.pig.evaluation.DateExtractor:
>> unable to parse date "+str);
>>             return null;
>>         } catch(Exception e){
>>             throw WrappedIOException.wrap("Caught exception processing
>> input
>> row ", e);
>>         }
>>     }
>>
>> and is used like this (hopefully—I can't find the script that used it):
>>
>> DEFINE Xdate com.com.releware.pig.evaluation.DateExtractor;
>>
>> A = *load log;*
>> B = foreach A generate Xdate(A.stupid_timestamp, 'MM-dd');
>>
>> Hope this helps!
>>
>> /David
>>
>>
>>
>
>