ztomanek-dw opened a new pull request, #2840: URL: https://github.com/apache/drill/pull/2840
# [DRILL-8457](https://issues.apache.org/jira/browse/DRILL-8457): Allow configuring csv parser in http storage plugin configuration ## Description HttpApiConfiguration was extended with `csvOptions` field which allows setting a following properties: ```json { "csvOptions": { "delimiter": ",", "quote": "\"", "quoteEscape": "\"", "lineSeparator": "\n", "headerExtractionEnabled": null, "numberOfRowsToSkip": 0, "numberOfRecordsToRead": -1, "lineSeparatorDetectionEnabled": true, "maxColumns": 512, "maxCharsPerColumn": 4096, "skipEmptyLines": true, "ignoreLeadingWhitespaces": true, "ignoreTrailingWhitespaces": true, "nullValue": null } } ``` this provides greater csv parsing flexibility since user can set different delimiters, number of columns or max column size. Also backward compatibility is ensured and parser works same as before if `csvOptions` is null. ## Documentation Add a following paragraph into https://drill.apache.org/docs/http-storage-plugin/#configuring-the-api-connections ``` ##### CSV parser options CSV parser of HTTP Storage plugin can be configured using `csvOptions`. ```json { "csvOptions": { "delimiter": ",", "quote": "\"", "quoteEscape": "\"", "lineSeparator": "\n", "headerExtractionEnabled": null, "numberOfRowsToSkip": 0, "numberOfRecordsToRead": -1, "lineSeparatorDetectionEnabled": true, "maxColumns": 512, "maxCharsPerColumn": 4096, "skipEmptyLines": true, "ignoreLeadingWhitespaces": true, "ignoreTrailingWhitespaces": true, "nullValue": null } } ``` E.g. to parse `.tsv` files you can use a following config: ```json { "csvOptions": { "delimiter": "\t" } } ``` ``` ## Testing Create a following storage plugin with name `github` ```json { "type": "http", "connections": { "test-data": { "url": "https://raw.githubusercontent.com/semantic-web-company/wic-tsv/master/data/de/Test/test_examples.txt", "requireTail": false, "method": "GET", "authType": "none", "inputType": "csv", "xmlDataLevel": 1, "postParameterLocation": "QUERY_STRING", "csvOptions": { "delimiter": "\t", "quote": "\"", "quoteEscape": "\"", "lineSeparator": "\n", "numberOfRecordsToRead": -1, "lineSeparatorDetectionEnabled": true, "maxColumns": 512, "maxCharsPerColumn": 4096, "skipEmptyLines": true, "ignoreLeadingWhitespaces": true, "ignoreTrailingWhitespaces": true }, "verifySSLCert": true } }, "timeout": 5, "retryDelay": 1000, "proxyType": "direct", "authMode": "SHARED_USER", "enabled": true } ``` Then query tsv file with ```sql SELECT * from github.`test-data` ```. You should see a result set containing three columns -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@drill.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org