cgivre opened a new pull request #2112:
URL: https://github.com/apache/drill/pull/2112


   # [DRILL-7534](https://issues.apache.org/jira/browse/DRILL-7534): Convert 
HTTPD Format Plugin to EVF
   
   ## Description
   This PR updates the HTTPD format plugin to use the Enhanced Vector Framework 
(EVF).  In theory there are few changes a user might notice.
   
   1. A new configuration option `maxErrors` has been added which will allow a 
user to tune how fault tolerant they want Drill to be when reading log files. 
   2.  Two new implicit fields have been added, `_raw` and `_matched`.  They 
are described in the docs below. 
   3.  The plugin now includes a limit pushdown which significantly improves 
query times for queries with limits.
   4.  The plugin code is now in the `contrib` folder.
   
   In addition, this PR updates the associated User Agent parsing functions 
with the latest version of the underlying libraries.
   
   ## Documentation
   # Web Server Log Format Plugin (HTTPD)
   This plugin enables Drill to read and query httpd (Apache Web Server) and 
nginx logs natively. This plugin uses the work by [Niels 
Basjes](https://github.com/nielsbasjes) which is available here: 
https://github.com/nielsbasjes/logparser.
   
   ## Configuration
   There are three fields which you will need to configure in order for Drill 
to read web server logs which are:
   * **`logFormat`**:  The log format string is the format string found in your 
web server configuration.
   * **`timestampFormat`**:  The format of time stamps in your log files.
   * **`extensions`**:  The file extension of your web server logs.
   * **`maxErrors`**:  Sets the plugin error tolerence. When set to any value 
less than `0`, Drill will ignore all errors. 
   
   ```json
   "httpd" : {
     "type" : "httpd",
     "logFormat" : "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" 
\"%{User-agent}i\"",
     "timestampFormat" : "dd/MMM/yyyy:HH:mm:ss ZZ",
     "maxErrors": 0
   }
   ```
   
   ### Implicit Columns
   Data queried by this plugin will return two implicit columns:
   
   * **`_raw`**: This returns the raw, unparsed log line
   * **`_matched`**:  Returns `true` or `false` depending on whether the line 
matched the config string.
   
   Thus, if you wanted to see which lines in your log file were not matching 
the config, you could use the following query:
   
   ```sql
   SELECT _raw
   FROM <data>
   WHERE _matched = false
   ```
   ## Testing
   Added additional unit tests for this plugin.  Ran all unit tests for the 
`parse_user_agent()` UDF as well. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to