[GitHub] [drill] cgivre commented on a change in pull request #2089: Drill-7751: Add Storage Plugin for Splunk

GitBox Tue, 02 Feb 2021 21:09:03 -0800


cgivre commented on a change in pull request #2089:
URL: https://github.com/apache/drill/pull/2089#discussion_r569134942




##########
File path: contrib/storage-splunk/README.md
##########
@@ -0,0 +1,152 @@
+# Drill Connector for Splunk
+This plugin enables Drill to query Splunk. 
+
+## Configuration
+To connect Drill to Splunk, create a new storage plugin with the following 
configuration:
+
+To successfully connect, Splunk uses port `8089` for interfaces.  This port 
must be open for Drill to query Splunk. 
+
+```json
+{
+   "type":"splunk",
+   "username": "admin",
+   "password": "changeme",
+   "hostname": "localhost",
+   "port": 8089,
+   "earliestTime": "-14d",
+   "latestTime": "now",
+   "enabled": false
+}
+```
+
+## Understanding Splunk's Data Model
+Splunk's primary use case is analyzing event logs with a timestamp. As such, 
data is indexed by the timestamp, with the most recent data being indexed 
first.  By default, Splunk
+ will sort the data in reverse chronological order.  Large Splunk 
installations will put older data into buckets of hot, warm and cold storage 
with the "cold" storage on the
+  slowest and cheapest disks.
+  
+With this understood, it is **very** important to put time boundaries on your 
Splunk queries. The Drill plugin allows you to set default values in the 
configuration such that every
+ query you run will be bounded by these boundaries.  Alternatively, you can 
set the time boundaries at query time.  In either case, you will achieve the 
best performance when
+  you are asking Splunk for the smallest amount of data possible.
+  
+## Understanding Drill's Data Model with Splunk
+Drill treats Splunk indexes as tables. Splunk's access model does not restrict 
to the catalog, but does restrict access to the actual data. It is therefore 
possible that you can
+ see the names of indexes to which you do not have access.  You can view the 
list of available indexes with a `SHOW TABLES IN splunk` query.
+  
+```
+apache drill> SHOW TABLES IN splunk;
++--------------+----------------+
+| TABLE_SCHEMA |   TABLE_NAME   |
++--------------+----------------+
+| splunk       | summary        |
+| splunk       | splunklogger   |
+| splunk       | _thefishbucket |
+| splunk       | _audit         |
+| splunk       | _internal      |
+| splunk       | _introspection |
+| splunk       | main           |
+| splunk       | history        |
+| splunk       | _telemetry     |
++--------------+----------------+
+9 rows selected (0.304 seconds)
+```
+To query Splunk from Drill, use the following format: 
+```sql
+SELECT <fields>
+FROM splunk.<index>
+```
+  
+ ## Bounding Your Queries
+  When you learn to query Splunk via their interface, the first thing you 
learn is to bound your queries so that they are looking at the shortest time 
span possible. When using
+   Drill to query Splunk, it is advisable to do the same thing, and Drill 
offers two ways to accomplish this: via the configuration and at query time.
+   
+  ### Bounding your Queries at Query Time
+  The easiest way to bound your query is to do so at querytime via special 
filters in the `WHERE` clause. There are two special fields, `earliestTime` and 
`latestTime` which can
+   be set to bound the query. If they are not set, the query will be bounded 
to the defaults set in the configuration.
+   
+   You can use any of the time formats specified in the Splunk documentation 
here:   
+  
https://docs.splunk.com/Documentation/Splunk/8.0.3/SearchReference/SearchTimeModifiers
+  
+  So if you wanted to see your data for the last 15 minutes, you could execute 
the following query:
+
+```sql
+SELECT <fields>
+FROM splunk.<index>
+WHERE earliestTime='-15m' AND latestTime='now'
+```
+The variables set in a query override the defaults from the configuration. 
+  
+ ## Data Types
+  Splunk does not have sophisticated data types and unfortunately does not 
provide metadata from its query results.  With the exception of the fields 
below, Drill will interpret
+   all fields as `VARCHAR` and hence you will have to convert them to the 
appropriate data type at query time.
+  
+  #### Timestamp Fields
+  * `_indextime`
+  * `_time` 
+  
+  #### Numeric Fields
+  * `date_hour` 
+  * `date_mday`
+  * `date_minute`
+  * `date_second` 
+  * `date_year`
+  * `linecount`
+  
+ ### Nested Data
+ Splunk has two different types of nested data which roughly map to Drill's 
`LIST` and `MAP` data types. Unfortunately, there is no easy way to identify 
whether a field is a
+  nested field at querytime as Splunk does not provide any metadata and 
therefore all fields are treated as `VARCHAR`.

Review comment:
       @vdiravka 
   Thanks for the question.  I'd like to integrate this plugin with the 
metastore and give it the ability to set the schema.  Could you give me some 
sample code or docs as to how to do that?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] cgivre commented on a change in pull request #2089: Drill-7751: Add Storage Plugin for Splunk

Reply via email to