[
https://issues.apache.org/jira/browse/DRILL-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abhishek Girish updated DRILL-7751:
-----------------------------------
Fix Version/s: (was: 1.18.0)
1.19.0
> Add Storage Plugin for Splunk
> -----------------------------
>
> Key: DRILL-7751
> URL: https://issues.apache.org/jira/browse/DRILL-7751
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Other
> Affects Versions: 1.17.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Fix For: 1.19.0
>
>
> # Drill Connector for Splunk
> This plugin enables Drill to query Splunk.
> ## Configuration
> To connect Drill to Splunk, create a new storage plugin with the following
> configuration:
> To successfully connect, Splunk uses port `8089` for interfaces. This port
> must be open for Drill to query Splunk.
> ```json
> {
> "type":"splunk",
> "username": "admin",
> "password": "changeme",
> "hostname": "localhost",
> "port": 8089,
> "earliestTime": "-14d",
> "latestTime": "now",
> "enabled": false
> }
> ```
> ## Understanding Splunk's Data Model
> Splunk's primary use case is analyzing event logs with a timestamp. As such,
> data is indexed by the timestamp, with the most recent data being indexed
> first. By default, Splunk
> will sort the data in reverse chronological order. Large Splunk
> installations will put older data into buckets of hot, warm and cold storage
> with the "cold" storage on the
> slowest and cheapest disks.
>
> With this understood, it is **very** important to put time boundaries on your
> Splunk queries. The Drill plugin allows you to set default values in the
> configuration such that every
> query you run will be bounded by these boundaries. Alternatively, you can
> set the time boundaries at query time. In either case, you will achieve the
> best performance when
> you are asking Splunk for the smallest amount of data possible.
>
> ## Understanding Drill's Data Model with Splunk
> Drill treats Splunk indexes as tables. Splunk's access model does not
> restrict to the catalog, but does restrict access to the actual data. It is
> therefore possible that you can
> see the names of indexes to which you do not have access. You can view the
> list of available indexes with a `SHOW TABLES IN splunk` query.
>
> ```
> apache drill> SHOW TABLES IN splunk;
> +--------------+----------------+
> | TABLE_SCHEMA | TABLE_NAME |
> +--------------+----------------+
> | splunk | summary |
> | splunk | splunklogger |
> | splunk | _thefishbucket |
> | splunk | _audit |
> | splunk | _internal |
> | splunk | _introspection |
> | splunk | main |
> | splunk | history |
> | splunk | _telemetry |
> +--------------+----------------+
> 9 rows selected (0.304 seconds)
> ```
> To query Splunk from Drill, use the following format:
> ```sql
> SELECT <fields>
> FROM splunk.<index>
> ```
>
> ## Bounding Your Queries
> When you learn to query Splunk via their interface, the first thing you
> learn is to bound your queries so that they are looking at the shortest time
> span possible. When using
> Drill to query Splunk, it is advisable to do the same thing, and Drill
> offers two ways to accomplish this: via the configuration and at query time.
>
> ### Bounding your Queries at Query Time
> The easiest way to bound your query is to do so at querytime via special
> filters in the `WHERE` clause. There are two special fields, `earliestTime`
> and `latestTime` which can
> be set to bound the query. If they are not set, the query will be bounded
> to the defaults set in the configuration.
>
> You can use any of the time formats specified in the Splunk documentation
> here:
>
> https://docs.splunk.com/Documentation/Splunk/8.0.3/SearchReference/SearchTimeModifiers
>
> So if you wanted to see your data for the last 15 minutes, you could
> execute the following query:
> ```sql
> SELECT <fields>
> FROM splunk.<index>
> WHERE earliestTime='-15m' AND latestTime='now'
> ```
> The variables set in a query override the defaults from the configuration.
>
> ## Data Types
> Splunk does not have sophisticated data types and unfortunately does not
> provide metadata from its query results. With the exception of the fields
> below, Drill will interpret
> all fields as `VARCHAR` and hence you will have to convert them to the
> appropriate data type at query time.
>
> #### Timestamp Fields
> * `_indextime`
> * `_time`
>
> #### Numeric Fields
> * `date_hour`
> * `date_mday`
> * `date_minute`
> * `date_second`
> * `date_year`
> * `linecount`
>
> ### Nested Data
> Splunk has two different types of nested data which roughly map to Drill's
> `LIST` and `MAP` data types. Unfortunately, there is no easy way to identify
> whether a field is a
> nested field at querytime as Splunk does not provide any metadata and
> therefore all fields are treated as `VARCHAR`.
>
> However, Drill does have built in functions to easily convert Splunk
> multifields into Drill `LIST` and `MAP` data types. For a LIST, simply use
> the
> `SPLIT(<field>, ' ')` function to split the field into a `LIST`.
>
> `MAP` data types are rendered as JSON in Splunk. Fortunately JSON can
> easily be parsed into a Drill Map by using the `convert_fromJSON()` function.
> The query below
> demonstrates how to convert a JSON column into a Drill `MAP`.
>
> ```sql
> SELECT convert_fromJSON(_raw)
> FROM splunk.spl
> WHERE spl = '| makeresults
> | eval _raw="{\"pc\":{\"label\":\"PC\",\"count\":24,\"peak24\":12},\"ps3\":
> {\"label\":\"PS3\",\"count\":51,\"peak24\":10},\"xbox\":
> {\"label\":\"XBOX360\",\"count\":40,\"peak24\":11},\"xone\":
> {\"label\":\"XBOXONE\",\"count\":105,\"peak24\":99},\"ps4\":
> {\"label\":\"PS4\",\"count\":200,\"peak24\":80}}"'
> ```
> ### Selecting Fields
> When you execute a query in Drill for Splunk, the fields you select are
> pushed down to Splunk. Therefore, it will always be more efficient to
> explicitly specify fields to push
> down to Splunk rather than using `SELECT *` queries.
>
> ### Special Fields
> There are several fields which can be included in a Drill query
>
> * `spl`: If you just want to send an SPL query to Splunk, this will do
> that.
> * `earliestTime`: Overrides the `earliestTime` setting in the configuration.
> * `latestTime`: Overrides the `latestTime` setting in the configuration.
>
> ### Sorting Results
> Due to the nature of Splunk indexes, data will always be returned in reverse
> chronological order. Thus, sorting is not necessary if that is the desired
> order.
> ## Sending Arbitrary SPL to Splunk
> There is a special table called `spl` which you can use to send arbitrary
> queries to Splunk. If you use this table, you must include a query in the
> `spl` filter as shown below:
> ```sql
> SELECT *
> FROM splunk.spl
> WHERE spl='<your SPL query'
> ```
> # Testing the Plugin
> This plugin includes a series of unit tests in the `src/test/` directory,
> however you will need an active Splunk installation to run them. Since
> Splunk is not an open source
> project, nor is available as a Docker container, simply follow the
> instructions below to test Splunk with Drill.
>
> ### Step 1: Get Splunk
> From Splunk's website, simply download and install the free version here:
> https://www.splunk.com/en_us/download/splunk-enterprise.html
>
> Once you've downloaded Splunk, start it up, and make sure everything is
> working properly.
>
> ### Step 2: Add Data
> Next, go here:
> https://docs.splunk.com/Documentation/Splunk/7.0.3/SearchTutorial/Systemrequirements
> and download the dummy datasets that Splunk provides. Once you've downloaded
> this data, have Splunk index this data and you're ready to go from the
> Splunk end.
>
> ## Known Limitations
> * At present, Drill will not interpret Splunk multifields as anything other
> than a String. If there is interest, this feature can be implemented.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)