Hello Drill Community, I would like to put forward some thoughts I've had relating to the CSV reader in Drill. I would like to propose a few changes which could actually be breaking changes, so I wanted to see if there are any strongly held opinions in the community. Here goes:
The Problems: 1. The default behavior for Drill is to leave the extractColumnHeaders option as false. When a user queries a CSV file this way, the results are returned in a list of columns called columns. Thus if a user wants the first column, they would project columns[0]. I have never been a fan of this behavior. Even though Drill ships with the csvh file extension which enables the header extraction, this is not a commonly used file format. Furthermore, the returned results (the column list) does not work well with BI tools. 2. The CSV reader does not attempt to do any kind of data type discovery. Proposed Changes: The overall goal is to make it easier to query CSV data and also to make the behavior more consistent across format plugins. 1. Change the default behavior and set the extractHeaders to true. 2. Other formats, like the excel reader, read tables directly into columns. If the header is not known, Drill assigns a name of field_n. I would propose replacing the `columns` array with a model similar to the Excel reader. 3. Implement schema discovery (data types) with an allTextMode option similar to the JSON reader. When the allTextMode is disabled, the CSV reader would attempt to infer data types. Since there are some breaking changes here, I'd like to ask if people have any strong feelings on this topic or suggestions. Thanks!, -- C
