pig-user  

Re: piggybank apachelogparser.DateExtractor problem

Johannes Rußek
Wed, 17 Mar 2010 08:12:22 -0700

Hi David!
Thanks a lot for your detailed answer, i will try to use your UDF :)
What bothers me though is that it appears that the DateExtractor had worked like we expected at some point in time, since the docs say to use it like that and i could find a bunch of blog posts using it with the format in the constructor..
Thanks anyway :)
Johannes

Am 17.03.2010 10:07, schrieb David Vrensk:
On Tue, Mar 16, 2010 at 19:58, Johannes Rußek<
johannes.rus...@io-consulting.net>  wrote:

Hello everybody,
I've been trying to use
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor from
piggybank that comes with pig 0.6.0, but i don't seem to be able to set the
output format.
whatever i use as the argument in the construct:

DEFINE MyDateExtractor
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('HH:mm:ss');

i only ever get yyyy-MM-dd back.
however, when i change DEFAULT_OUTGOING_DATE_FORMAT in
main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser/DateExtractor.java
to something like 'yyyy-MM-dd-HH' it is able to output the right format.
Am i doing something wrong?

I don't think so.  I ran into the same problem a couple of weeks ago, and
played around with the code inserting some print/log statements.  It turns
out that the arguments are only used in the initial constructor calls, when
the pig process is starting, but once pig reaches the point where it would
use the udf, it creates new DateExtractors without passing the arguments.

I found two ways around this:

1. Let the initial calls to the constructor store the format in a static
variable.  This is brittle.
2. Supply a date format with the actual calls.  This is what I ended up
doing (in my own DateExtractor that I created in my own UDF lib).  The end
result looks like this:

     public DateExtractor() {}

     @Override
     public String exec(Tuple input) throws IOException {
         if (input == null || input.size() == 0)
             return null;

         DateFormat incomingDateFormat = defaultIncomingDateFormat;
         DateFormat outgoingDateFormat = defaultOutgoingDateFormat;
         if (input.size()>  1) {
             outgoingDateFormat = new SimpleDateFormat((String)input.get(1));
             outgoingDateFormat.setTimeZone(gmt);
         }
         if (input.size()>  2) {
             incomingDateFormat = new SimpleDateFormat((String)input.get(2));
             incomingDateFormat.setTimeZone(gmt);
         }

         String str="";
         try {
             str = (String)input.get(0);
             Date date = incomingDateFormat.parse(str);
             return outgoingDateFormat.format(date);

         } catch (ParseException pe) {
             System.err.println("releware.pig.evaluation.DateExtractor:
unable to parse date "+str);
             return null;
         } catch(Exception e){
             throw WrappedIOException.wrap("Caught exception processing input
row ", e);
         }
     }

and is used like this (hopefully—I can't find the script that used it):

DEFINE Xdate com.com.releware.pig.evaluation.DateExtractor;

A = *load log;*
B = foreach A generate Xdate(A.stupid_timestamp, 'MM-dd');

Hope this helps!

/David