pig-user  

Re: piggybank apachelogparser.DateExtractor problem

Johannes Rußek
Wed, 17 Mar 2010 09:37:13 -0700

Hi Dmitriy,
where would i open the Ticket?
http://issues.apache.org/jira/browse/PIG here? Should i mail the developer list first?
Sorry, i'm new to the apache project stuff :)
Johannes


Am 17.03.2010 16:30, schrieb Dmitriy Ryaboy:
Yeah that's weird. Especially the wrong constructor being called. Could you
open a ticket please?

On Wed, Mar 17, 2010 at 8:11 AM, Johannes Rußek<
johannes.rus...@io-consulting.net>  wrote:

Hi David!
Thanks a lot for your detailed answer, i will try to use your UDF :)
What bothers me though is that it appears that the DateExtractor had worked
like we expected at some point in time, since the docs say to use it like
that and i could find a bunch of blog posts using it with the format in the
constructor..
Thanks anyway :)
Johannes

Am 17.03.2010 10:07, schrieb David Vrensk:

  On Tue, Mar 16, 2010 at 19:58, Johannes Rußek<
johannes.rus...@io-consulting.net>   wrote:



Hello everybody,
I've been trying to use
org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor
from
piggybank that comes with pig 0.6.0, but i don't seem to be able to set
the
output format.
whatever i use as the argument in the construct:

DEFINE MyDateExtractor

org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('HH:mm:ss');

i only ever get yyyy-MM-dd back.
however, when i change DEFAULT_OUTGOING_DATE_FORMAT in

main/java/org/apache/pig/piggybank/evaluation/util/apachelogparser/DateExtractor.java
to something like 'yyyy-MM-dd-HH' it is able to output the right format.
Am i doing something wrong?



I don't think so.  I ran into the same problem a couple of weeks ago, and
played around with the code inserting some print/log statements.  It turns
out that the arguments are only used in the initial constructor calls,
when
the pig process is starting, but once pig reaches the point where it would
use the udf, it creates new DateExtractors without passing the arguments.

I found two ways around this:

1. Let the initial calls to the constructor store the format in a static
variable.  This is brittle.
2. Supply a date format with the actual calls.  This is what I ended up
doing (in my own DateExtractor that I created in my own UDF lib).  The end
result looks like this:

     public DateExtractor() {}

     @Override
     public String exec(Tuple input) throws IOException {
         if (input == null || input.size() == 0)
             return null;

         DateFormat incomingDateFormat = defaultIncomingDateFormat;
         DateFormat outgoingDateFormat = defaultOutgoingDateFormat;
         if (input.size()>   1) {
             outgoingDateFormat = new
SimpleDateFormat((String)input.get(1));
             outgoingDateFormat.setTimeZone(gmt);
         }
         if (input.size()>   2) {
             incomingDateFormat = new
SimpleDateFormat((String)input.get(2));
             incomingDateFormat.setTimeZone(gmt);
         }

         String str="";
         try {
             str = (String)input.get(0);
             Date date = incomingDateFormat.parse(str);
             return outgoingDateFormat.format(date);

         } catch (ParseException pe) {
             System.err.println("releware.pig.evaluation.DateExtractor:
unable to parse date "+str);
             return null;
         } catch(Exception e){
             throw WrappedIOException.wrap("Caught exception processing
input
row ", e);
         }
     }

and is used like this (hopefully—I can't find the script that used it):

DEFINE Xdate com.com.releware.pig.evaluation.DateExtractor;

A = *load log;*
B = foreach A generate Xdate(A.stupid_timestamp, 'MM-dd');

Hope this helps!

/David