alamb opened a new pull request #7906: URL: https://github.com/apache/arrow/pull/7906
This PR builds on https://github.com/apache/arrow/pull/7905 to do two things: 1. Make better error messages for CREATE EXTERNAL TABLE commands that are not semantically valid and prevents a subsequent panic in the planner 2. Move planning logic for CREATE EXTERNAL TABLE out of context.rs and into planer.rs The behavior before this PR: ``` CREATE EXTERNAL TABLE repro STORED AS CSV LOCATION 'repro.csv'; > select * from repro; thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', datafusion/src/optimizer/projection_push_down.rs:238:31 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` The behavior after this PR: ``` > CREATE EXTERNAL TABLE repro STORED AS CSV LOCATION 'repro.csv'; General("Column definitions required for CSV files. None found") ``` There are more detail on https://issues.apache.org/jira/browse/ARROW-9652 h2. Planning Logic Consolidation into planner.rs This change could be much smaller if I had placed the checks directly in parser.rs. However, I instead propose the larger (but I think more coherent) strategy of moving the planing logic for CREATE EXTERNAL into the planner. This means that now the planner handles all the DataFusion SQL Dialects internally rather than only the SQL ones. One reason I didn't just want to put the checks into parser.rs itself was that this check seems like it is actually a semantic check rather than a syntactic one -- aka there is nothing wrong with the SQL itself, but there is something logically wrong with the statement. The PR is broken up into two commits -- one to move the CREATE EXTERNAL TABLE code into the planner and the other to add new semantic checks. I am happy to break the change into two PRs if you prefer. I am also happy to move the check to parser.rs, if you prefer. h2. CSV Schema Inference Logic There appears to be some CSV schema inference logic in DataFusion (e.g. [CsvEec::try_infer et al](https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/physical_plan/csv.rs#L123) , but it does not appear to be run prior to the planner panicing. I may have missed a better intended behavior that the CSV schema inference logic gets run at CREATE EXTERNAL TABLE planning time. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
