gianm commented on PR #13360: URL: https://github.com/apache/druid/pull/13360#issuecomment-1343606869
A great improvement in usability! Specifying JSON-inside-SQL does get old. Thanks for improving this. After noticing this land in master, I realized I have a few comments arising from reading the new docs, so I'll write them here. - The new extension-column syntax is cool, and more SQLy. Would it be possible to use this with `EXTERN` too? Like, make the third argument (row signature) optional, and allow people to provide the column list in SQL? - Comma-separated files and URIs are a drag, I think. Local files and URIs can both legitimately have commas in them. Given that SQL supports arrays, could these be SQL arrays instead? - The bits about "this selects CsvInputFormat" and "this selects JsonInputFormat" should be reworded, because we shouldn't use Java class names in user-facing documentation. Instead we should use the type codes these classes have in the API, which would be "`csv` input format" and "`json` input format" respectively. These could link to the docs for those formats, i.e. `../ingestion/data-formats.html#csv` and `../ingestion-data-formats.html#json`. - Some table-function-format parameter names are different from the corresponding input-format parameters in ways that don't seem necessary. Like, `skipRows` vs `skipHeaderRows`. I think it'd be best if these were the same, so people can have an easier time translating between table-function-formats and input-formats. - For JSON, `keepNullColumns` isn't documented for the `json` input format, but `keepNulls` is documented for the table function. I'm not sure why it isn't documented for the input format. But I do note that the description here in the table function docs is confusing: what does it mean to "keep" or "not keep" null values? Given the confusingness of the behavior, and the fact that it isn't documented for input formats, I suggest we should remove it from the table-function-format too. - The "due to #13359 this functionality is broken" piece bugged me. It's not broken for the intended use case of a local input source, where you have a single-server setup, or a shared volume across all servers in a multi-server setup. In those cases, nothing bad happens by resolving relative paths upon first serialization. I mean, a user won't notice a difference, because all servers would resolve the path the same way anyway. So I was going to suggest removing this wording, or at least adjusting it such that it it's clear that the functionality works fine in the scenario it was designed for. However, I decided to simply fix it instead: https://github.com/apache/druid/pull/13534. That PR also removes the doc wording, since it's no longer needed. Btw, I also added the "Release Notes" label, since there is a "Release Notes" section in the PR description. The label helps the release manager find all the PRs that are associated with interesting release notes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
