[ 
https://issues.apache.org/jira/browse/DRILL-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993914#comment-14993914
 ] 

Jim Scott commented on DRILL-3423:
----------------------------------

To start fresh on this topic. My understanding of the capabilities of this 
parser grew ten fold while building this implementation. I do feel that it is 
already built in such a way that it will deliver the most flexibility and power 
to the user. That being said, I'm open to discussing the why's and why not's on 
this because I think this is one of the most important file formats we can add 
to drill.

On to the present...
I think we will be best served by using these examples with enough description 
so that we are being very specific and not speaking in generalities.

As of right now, with this logFormat: "%h %t \"%r\" %>s %b \"%{Referer}i\""
this query: select * from dfs.`jimslogfile.log`
with NO user configuration

Drill will yield these fields to the user:
TIME_STAMP:request_receive_time
TIME_DAY:request_receive_time_day
TIME_MONTHNAME:request_receive_time_monthname
TIME_MONTH:request_receive_time_month
TIME_WEEK:request_receive_time_weekofweekyear
TIME_YEAR:request_receive_time_weekyear
TIME_YEAR:request_receive_time_year
TIME_HOUR:request_receive_time_hour
TIME_MINUTE:request_receive_time_minute
TIME_SECOND:request_receive_time_second
TIME_MILLISECOND:request_receive_time_millisecond
TIME_ZONE:request_receive_time_timezone
TIME_EPOCH:request_receive_time_epoch
TIME_DAY:request_receive_time_day_utc
TIME_MONTHNAME:request_receive_time_monthname_utc
TIME_MONTH:request_receive_time_month_utc
TIME_WEEK:request_receive_time_weekofweekyear_utc
TIME_YEAR:request_receive_time_weekyear_utc
TIME_YEAR:request_receive_time_year_utc
TIME_HOUR:request_receive_time_hour_utc
TIME_MINUTE:request_receive_time_minute_utc
TIME_SECOND:request_receive_time_second_utc
TIME_MILLISECOND:request_receive_time_millisecond_utc
IP:connection_client_host
HTTP_FIRSTLINE:request_firstline
HTTP_METHOD:request_firstline_method
HTTP_URI:request_firstline_uri
HTTP_PROTOCOL:request_firstline_uri_protocol
HTTP_USERINFO:request_firstline_uri_userinfo
HTTP_HOST:request_firstline_uri_host
HTTP_PORT:request_firstline_uri_port
HTTP_PATH:request_firstline_uri_path
HTTP_QUERYSTRING:request_firstline_uri_query
STRING:request_firstline_uri_query:map
HTTP_REF:request_firstline_uri_ref
HTTP_PROTOCOL:request_firstline_protocol
HTTP_PROTOCOL_VERSION:request_firstline_protocol_version
HTTP_URI:request_referer
HTTP_PROTOCOL:request_referer_protocol
HTTP_USERINFO:request_referer_userinfo
HTTP_HOST:request_referer_host
HTTP_PORT:request_referer_port
HTTP_PATH:request_referer_path
HTTP_QUERYSTRING:request_referer_query
STRING:request_referer_query:map
HTTP_REF:request_referer_ref
STRING:request_status_last
BYTES:response_body_bytesclf

I believe the benefit of this is that the user will be able to easily refine 
and figure out what they are looking for, which will allow them to then 
optimize the parsing by adding specific fields to the configuration file. This 
could be copy & paste style if we change the plugin configuration be use _ 
instead of . as mentioned in my previous comment. Which I would be good with as 
it would certainly make it easier for the user and will reduce the likelihood 
of configuration mistakes.

By removing the first part of the field name "HTTP_URI:" it would clean up the 
names, but while it is cleaner it doesn't simplify the user experience in my 
opinion. I also don't believe that allowing a user to map those fields to 
different names improves the user experience, and I would actually argue that 
it would detract from it by introducing the possibility of confusion or 
mistakes (we know users mess up configurations all the time and these are 
difficult for beginners to troubleshoot).

With respect to nesting the data in maps, I think the only time we would want 
to do that is when there is a wildcard they are trying to capture. The reason 
being, to me, when I think about parsing a log line in any application, I 
expect to get a flat, tabular type of result set. I wouldn't be expecting 
complex data structures to come back.


> Add New HTTPD format plugin
> ---------------------------
>
>                 Key: DRILL-3423
>                 URL: https://issues.apache.org/jira/browse/DRILL-3423
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Storage - Other
>            Reporter: Jacques Nadeau
>            Assignee: Jim Scott
>             Fix For: 1.4.0
>
>
> Add an HTTPD logparser based format plugin.  The author has been kind enough 
> to move the logparser project to be released under the Apache License.  Can 
> find it here:
> <dependency>
>     <groupId>nl.basjes.parse.httpdlog</groupId>
>     <artifactId>httpdlog-parser</artifactId>
>     <version>2.0</version>
> </dependency>
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to