[
https://issues.apache.org/jira/browse/DRILL-7716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Givre updated DRILL-7716:
---------------------------------
Labels: enhancement (was: )
> Create Format Plugin for SPSS Files
> -----------------------------------
>
> Key: DRILL-7716
> URL: https://issues.apache.org/jira/browse/DRILL-7716
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - Text & CSV
> Affects Versions: 1.17.0
> Reporter: Charles Givre
> Assignee: Charles Givre
> Priority: Major
> Labels: enhancement
> Fix For: 1.18.0
>
>
> # Format Plugin for SPSS (SAV) Files
> This format plugin enables Apache Drill to read and query Statistical Package
> for the Social Sciences (SPSS) (or Statistical Product and Service Solutions)
> data files. According
> to Wikipedia: [1]
>
> SPSS is a widely used program for statistical analysis in social science. It
> is also used by market researchers, health researchers, survey companies,
> government, education researchers, marketing organizations, data miners, and
> others. The original SPSS manual (Nie, Bent & Hull, 1970) has been described
> as one of "sociology's most influential books" for allowing ordinary
> researchers to do their own statistical analysis. In addition to statistical
> analysis, data management (case selection, file reshaping, creating derived
> data) and data documentation (a metadata dictionary is stored in the
> datafile) are features of the base software.
>
>
> ## Configuration
> To configure Drill to read SPSS files, simply add the following code to the
> formats section of your file-based storage plugin. This should happen
> automatically for the default
> `cp`, `dfs`, and `S3` storage plugins.
>
> Other than the file extensions, there are no variables to configure.
>
> ```json
> "spss": {
> "type": "spss",
> "extensions": [
> "sav"
> ]
> }
> ```
> ## Data Model
> SPSS only supports two data types: Numeric and Strings. Drill maps these to
> `DOUBLE` and `VARCHAR` respectively. However, for some numeric columns, SPSS
> maps these numbers to
> text, similar to an `enum` field in Java.
>
> For instance, a field called `Survey` might have labels as shown below:
>
> <table>
> <tr>
> <th>Value</th>
> <th>Text</th>
> </tr>
> <tr>
> <td>1</td>
> <td>Yes</td>
> </tr>
> <tr>
> <td>2</td>
> <td>No</td>
> </tr>
> <tr>
> <td>99</td>
> <td>No Answer</td>
> </tr>
> </table>
> For situations like this, Drill will create two columns. In the example above
> you would get a column called `Survey` which has the numeric value (1,2 or
> 99) as well as a column
> called `Survey_value` which will map the integer to the appropriate value.
> Thus, the results would look something like this:
>
> <table>
> <tr>
> <th>`Survey`</th>
> <th>`Survey_value`</th>
> </tr>
> <tr>
> <td>1</td>
> <td>Yes</td>
> </tr>
> <tr>
> <td>1</td>
> <td>Yes</td>
> </tr>
> <tr>
> <td>1</td>
> <td>Yes</td>
> </tr>
> <tr>
> <td>2</td>
> <td>No</td>
> </tr>
> <tr>
> <td>1</td>
> <td>Yes</td>
> </tr>
> <tr>
> <td>2</td>
> <td>No</td>
> </tr>
> <tr>
> <td>99</td>
> <td>No Answer</td>
> </tr>
> </table>
> [1]: https://en.wikipedia.org/wiki/SPSS
--
This message was sent by Atlassian Jira
(v8.3.4#803005)