[
https://issues.apache.org/jira/browse/DRILL-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779443#comment-17779443
]
ASF GitHub Bot commented on DRILL-8457:
---------------------------------------
ztomanek-dw opened a new pull request, #2840:
URL: https://github.com/apache/drill/pull/2840
# [DRILL-8457](https://issues.apache.org/jira/browse/DRILL-8457): Allow
configuring csv parser in http storage plugin configuration
## Description
HttpApiConfiguration was extended with `csvOptions` field which allows
setting a following properties:
```json
{
"csvOptions": {
"delimiter": ",",
"quote": "\"",
"quoteEscape": "\"",
"lineSeparator": "\n",
"headerExtractionEnabled": null,
"numberOfRowsToSkip": 0,
"numberOfRecordsToRead": -1,
"lineSeparatorDetectionEnabled": true,
"maxColumns": 512,
"maxCharsPerColumn": 4096,
"skipEmptyLines": true,
"ignoreLeadingWhitespaces": true,
"ignoreTrailingWhitespaces": true,
"nullValue": null
}
}
```
this provides greater csv parsing flexibility since user can set different
delimiters, number of columns or max column size.
Also backward compatibility is ensured and parser works same as before if
`csvOptions` is null.
## Documentation
Add a following paragraph into
https://drill.apache.org/docs/http-storage-plugin/#configuring-the-api-connections
```
##### CSV parser options
CSV parser of HTTP Storage plugin can be configured using `csvOptions`.
```json
{
"csvOptions": {
"delimiter": ",",
"quote": "\"",
"quoteEscape": "\"",
"lineSeparator": "\n",
"headerExtractionEnabled": null,
"numberOfRowsToSkip": 0,
"numberOfRecordsToRead": -1,
"lineSeparatorDetectionEnabled": true,
"maxColumns": 512,
"maxCharsPerColumn": 4096,
"skipEmptyLines": true,
"ignoreLeadingWhitespaces": true,
"ignoreTrailingWhitespaces": true,
"nullValue": null
}
}
```
E.g. to parse `.tsv` files you can use a following config:
```json
{
"csvOptions": {
"delimiter": "\t"
}
}
```
```
## Testing
Create a following storage plugin with name `github`
```json
{
"type": "http",
"connections": {
"test-data": {
"url":
"https://raw.githubusercontent.com/semantic-web-company/wic-tsv/master/data/de/Test/test_examples.txt",
"requireTail": false,
"method": "GET",
"authType": "none",
"inputType": "csv",
"xmlDataLevel": 1,
"postParameterLocation": "QUERY_STRING",
"csvOptions": {
"delimiter": "\t",
"quote": "\"",
"quoteEscape": "\"",
"lineSeparator": "\n",
"numberOfRecordsToRead": -1,
"lineSeparatorDetectionEnabled": true,
"maxColumns": 512,
"maxCharsPerColumn": 4096,
"skipEmptyLines": true,
"ignoreLeadingWhitespaces": true,
"ignoreTrailingWhitespaces": true
},
"verifySSLCert": true
}
},
"timeout": 5,
"retryDelay": 1000,
"proxyType": "direct",
"authMode": "SHARED_USER",
"enabled": true
}
```
Then query tsv file with
```sql
SELECT * from github.`test-data`
```.
You should see a result set containing three columns
> Allow configuring csv parser in http storage plugin configuration
> -----------------------------------------------------------------
>
> Key: DRILL-8457
> URL: https://issues.apache.org/jira/browse/DRILL-8457
> Project: Apache Drill
> Issue Type: Improvement
> Components: Storage - HTTP
> Affects Versions: Future
> Reporter: Zbigniew Tomanek
> Priority: Minor
> Fix For: Future
>
>
> Currently there is no way to configure csv parser when http plugin is used.
> Because of that some kind of files cannot be parsed (e.g. when any column has
> more than 4096 chars or file has a delimiter different from `,`).
> Since in DataWalk we utilize http plugin quite often we've changed our
> internal fork of Drill so following parser/format properties can be
> configured using additional `csvOptions` field:
>
> {code:json}
> {
> "csvOptions": {
> "delimiter": "\t",
> "quote": "\"",
> "quote_escape": "\"",
> "line_separator": "\n",
> "header_extraction_enabled": null,
> "number_of_rows_to_skip": 0,
> "number_of_records_to_read": -1,
> "line_separator_detection_enabled": true,
> "max_columns": 512,
> "max_chars_per_column": 4096,
> "skip_empty_lines": true,
> "ignore_leading_whitespaces": true,
> "ignore_trailing_whitespaces": true,
> "null_value": null
> }
> }{code}
> I'd be glad to get feedback whether creating PR with these changes would
> bring any value to the Drill
--
This message was sent by Atlassian Jira
(v8.20.10#820010)