[ 
https://issues.apache.org/jira/browse/PIG-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658348#comment-14658348
 ] 

Niels Basjes commented on PIG-4639:
-----------------------------------

A simple first test you can run is after building pig with this patch is to run 
this pig script locally:

{code}
REGISTER ./contrib/piggybank/java/piggybank.jar

Example =
    LOAD 'test.pig'
    USING 
org.apache.pig.piggybank.storage.apachelog.LogFormatLoader('combined');
DUMP Example;
{code}

The output is the example on how you can define a working parser that gives you 
the fields you want.
This output is actual working pig code that will parse an Apache httpd 
accesslog file in the given format into all the fields requested.
In this example case this output is a single tuple with a single string that 
looks like this:
{code}
(


Clicks =
    LOAD 'access.log'
    USING org.apache.pig.piggybank.storage.apachelog.LogFormatLoader(
        'combined',

        'IP:connection.client.host',
        'NUMBER:connection.client.logname',
        'STRING:connection.client.user',
        'TIME.STAMP:request.receive.time',
        'TIME.DAY:request.receive.time.day',
        'TIME.MONTHNAME:request.receive.time.monthname',
        'TIME.MONTH:request.receive.time.month',
        'TIME.WEEK:request.receive.time.weekofweekyear',
        'TIME.YEAR:request.receive.time.weekyear',
        'TIME.YEAR:request.receive.time.year',
        'TIME.HOUR:request.receive.time.hour',
        'TIME.MINUTE:request.receive.time.minute',
        'TIME.SECOND:request.receive.time.second',
        'TIME.MILLISECOND:request.receive.time.millisecond',
        'TIME.ZONE:request.receive.time.timezone',
        'TIME.EPOCH:request.receive.time.epoch',
        'TIME.DAY:request.receive.time.day_utc',
        'TIME.MONTHNAME:request.receive.time.monthname_utc',
        'TIME.MONTH:request.receive.time.month_utc',
        'TIME.WEEK:request.receive.time.weekofweekyear_utc',
        'TIME.YEAR:request.receive.time.weekyear_utc',
        'TIME.YEAR:request.receive.time.year_utc',
        'TIME.HOUR:request.receive.time.hour_utc',
        'TIME.MINUTE:request.receive.time.minute_utc',
        'TIME.SECOND:request.receive.time.second_utc',
        'TIME.MILLISECOND:request.receive.time.millisecond_utc',
        'HTTP.FIRSTLINE:request.firstline',
        'HTTP.METHOD:request.firstline.method',
        'HTTP.URI:request.firstline.uri',
        'HTTP.PROTOCOL:request.firstline.uri.protocol',
        'HTTP.USERINFO:request.firstline.uri.userinfo',
        'HTTP.HOST:request.firstline.uri.host',
        'HTTP.PORT:request.firstline.uri.port',
        'HTTP.PATH:request.firstline.uri.path',
        'HTTP.QUERYSTRING:request.firstline.uri.query',
        'STRING:request.firstline.uri.query.*',  -- If you only want a single 
field replace * with name and change type to chararray',
        'HTTP.REF:request.firstline.uri.ref',
        'HTTP.PROTOCOL:request.firstline.protocol',
        'HTTP.PROTOCOL.VERSION:request.firstline.protocol.version',
        'STRING:request.status.last',
        'BYTES:response.body.bytesclf',
        'HTTP.URI:request.referer',
        'HTTP.PROTOCOL:request.referer.protocol',
        'HTTP.USERINFO:request.referer.userinfo',
        'HTTP.HOST:request.referer.host',
        'HTTP.PORT:request.referer.port',
        'HTTP.PATH:request.referer.path',
        'HTTP.QUERYSTRING:request.referer.query',
        'STRING:request.referer.query.*',  -- If you only want a single field 
replace * with name and change type to chararray',
        'HTTP.REF:request.referer.ref',
        'HTTP.USERAGENT:request.user-agent')
    AS (
        connection_client_host:chararray,
        connection_client_logname:long,
        connection_client_user:chararray,
        request_receive_time:chararray,
        request_receive_time_day:long,
        request_receive_time_monthname:chararray,
        request_receive_time_month:long,
        request_receive_time_weekofweekyear:long,
        request_receive_time_weekyear:long,
        request_receive_time_year:long,
        request_receive_time_hour:long,
        request_receive_time_minute:long,
        request_receive_time_second:long,
        request_receive_time_millisecond:long,
        request_receive_time_timezone:chararray,
        request_receive_time_epoch:long,
        request_receive_time_day_utc:long,
        request_receive_time_monthname_utc:chararray,
        request_receive_time_month_utc:long,
        request_receive_time_weekofweekyear_utc:long,
        request_receive_time_weekyear_utc:long,
        request_receive_time_year_utc:long,
        request_receive_time_hour_utc:long,
        request_receive_time_minute_utc:long,
        request_receive_time_second_utc:long,
        request_receive_time_millisecond_utc:long,
        request_firstline:chararray,
        request_firstline_method:chararray,
        request_firstline_uri:chararray,
        request_firstline_uri_protocol:chararray,
        request_firstline_uri_userinfo:chararray,
        request_firstline_uri_host:chararray,
        request_firstline_uri_port:long,
        request_firstline_uri_path:chararray,
        request_firstline_uri_query:chararray,
        request_firstline_uri_query__:map[],  -- If you only want a single 
field replace * with name and change type to chararray,
        request_firstline_uri_ref:chararray,
        request_firstline_protocol:chararray,
        request_firstline_protocol_version:chararray,
        request_status_last:chararray,
        response_body_bytesclf:long,
        request_referer:chararray,
        request_referer_protocol:chararray,
        request_referer_userinfo:chararray,
        request_referer_host:chararray,
        request_referer_port:long,
        request_referer_path:chararray,
        request_referer_query:chararray,
        request_referer_query__:map[],  -- If you only want a single field 
replace * with name and change type to chararray,
        request_referer_ref:chararray,
        request_user_agent:chararray);



)

{code}


> Add better parser for Apache HTTPD access log.
> ----------------------------------------------
>
>                 Key: PIG-4639
>                 URL: https://issues.apache.org/jira/browse/PIG-4639
>             Project: Pig
>          Issue Type: New Feature
>          Components: piggybank
>    Affects Versions: 0.15.0
>            Reporter: Niels Basjes
>            Assignee: Niels Basjes
>             Fix For: 0.16.0
>
>         Attachments: PIG-4639-20150723-classnotfound.patch, 
> PIG-4639-20150725.patch, PIG-4639-20150805-1247.patch
>
>
> Currently there are two parsers for Apache HTTPD acces log files in piggybank 
> that only allow parsing the 'combined' and 'common' logformats. These two 
> also only parse the 'basics'.
> This is proposed patch to add the existing 
> https://github.com/nielsbasjes/logparser (Apache 2.0 license) as an 'out of 
> the box' parser to piggybank. 
> This parser parses the logfile using the LogFormat specification used to 
> writte it. Almost all LogFormat specifiers are supported and as such adds 
> easy parsing capabilities for (almost) all custom logformats used in 
> production scenarios. 
> This parser also goes much deeper in the sense that it allows extracting 
> things like the value of a cookie or the value of a  query string parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to