ztomanek-dw opened a new pull request, #2840:
URL: https://github.com/apache/drill/pull/2840

   # [DRILL-8457](https://issues.apache.org/jira/browse/DRILL-8457): Allow 
configuring csv parser in http storage plugin configuration
   
   ## Description
   
   HttpApiConfiguration was extended with `csvOptions` field which allows 
setting a following properties:
   
   ```json
   {
     "csvOptions": {
       "delimiter": ",",
       "quote": "\"",
       "quoteEscape": "\"",
       "lineSeparator": "\n",
       "headerExtractionEnabled": null,
       "numberOfRowsToSkip": 0,
       "numberOfRecordsToRead": -1,
       "lineSeparatorDetectionEnabled": true,
       "maxColumns": 512,
       "maxCharsPerColumn": 4096,
       "skipEmptyLines": true,
       "ignoreLeadingWhitespaces": true,
       "ignoreTrailingWhitespaces": true,
       "nullValue": null
     }
   }
   ```
   
   this provides greater csv parsing flexibility since user can set different 
delimiters, number of columns or max column size. 
   
   Also backward compatibility is ensured and parser works same as before if 
`csvOptions` is null.
   
   ## Documentation
   
   Add a following paragraph into 
https://drill.apache.org/docs/http-storage-plugin/#configuring-the-api-connections
   
   ```
   ##### CSV parser options
   
   CSV parser of HTTP Storage plugin can be configured using `csvOptions`.
   
   ```json
   {
     "csvOptions": {
       "delimiter": ",",
       "quote": "\"",
       "quoteEscape": "\"",
       "lineSeparator": "\n",
       "headerExtractionEnabled": null,
       "numberOfRowsToSkip": 0,
       "numberOfRecordsToRead": -1,
       "lineSeparatorDetectionEnabled": true,
       "maxColumns": 512,
       "maxCharsPerColumn": 4096,
       "skipEmptyLines": true,
       "ignoreLeadingWhitespaces": true,
       "ignoreTrailingWhitespaces": true,
       "nullValue": null
     }
   }
   ```
   
   E.g. to parse `.tsv` files you can use a following config:
   
   ```json
   {
     "csvOptions": {
       "delimiter": "\t"
     }
   }
   ```
   
   ```
   
   ## Testing
   
   Create a following storage plugin with name `github`
   
   
   ```json
   {
     "type": "http",
     "connections": {
       "test-data": {
         "url": 
"https://raw.githubusercontent.com/semantic-web-company/wic-tsv/master/data/de/Test/test_examples.txt";,
         "requireTail": false,
         "method": "GET",
         "authType": "none",
         "inputType": "csv",
         "xmlDataLevel": 1,
         "postParameterLocation": "QUERY_STRING",
         "csvOptions": {
           "delimiter": "\t",
           "quote": "\"",
           "quoteEscape": "\"",
           "lineSeparator": "\n",
           "numberOfRecordsToRead": -1,
           "lineSeparatorDetectionEnabled": true,
           "maxColumns": 512,
           "maxCharsPerColumn": 4096,
           "skipEmptyLines": true,
           "ignoreLeadingWhitespaces": true,
           "ignoreTrailingWhitespaces": true
         },
         "verifySSLCert": true
       }
     },
     "timeout": 5,
     "retryDelay": 1000,
     "proxyType": "direct",
     "authMode": "SHARED_USER",
     "enabled": true
   }
   ```
   
   Then query tsv file with 
   
   ```sql
   SELECT * from github.`test-data`
   ```.
   
   You should see a result set containing three columns
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to