alamb opened a new pull request #7906:
URL: https://github.com/apache/arrow/pull/7906


   This PR builds on https://github.com/apache/arrow/pull/7905 to do two things:
   1. Make better error messages for CREATE EXTERNAL TABLE commands that are 
not semantically valid and prevents a subsequent panic in the planner
   2. Move planning logic for CREATE EXTERNAL TABLE out of context.rs and into 
planer.rs
   
   The behavior before this PR:
   
   ```
   CREATE EXTERNAL TABLE repro
   STORED AS CSV
   LOCATION 'repro.csv';
   
   > select * from repro;
   thread 'main' panicked at 'index out of bounds: the len is 0 but the index 
is 0', datafusion/src/optimizer/projection_push_down.rs:238:31
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   
   The behavior after this PR:
   
   ```
   > CREATE EXTERNAL TABLE repro
   STORED AS CSV
   LOCATION 'repro.csv';
   
   General("Column definitions required for CSV files. None found")
   ```
   
   There are more detail on https://issues.apache.org/jira/browse/ARROW-9652
   
   h2. Planning Logic Consolidation into planner.rs
   
   This change could be much smaller if I had placed the checks directly in 
parser.rs. However, I instead propose the larger (but I think more coherent) 
strategy of moving the planing logic for CREATE EXTERNAL into the planner. This 
means that now the planner handles all the DataFusion SQL Dialects internally 
rather than only the SQL ones.
   
   One reason I didn't just want to put the checks into parser.rs itself was 
that this check seems like it is actually a semantic check rather than a 
syntactic one -- aka there is nothing wrong with the SQL itself, but there is 
something logically wrong with the statement.
   
   The PR is broken up into two commits -- one to move the CREATE EXTERNAL 
TABLE code into the planner and the other to add new semantic checks. I am 
happy to break the change into two PRs if you prefer.
   
   I am also happy to move the check to parser.rs, if you prefer.
   
   
   
   h2. CSV Schema Inference Logic
   There appears to be some CSV schema inference logic in DataFusion (e.g. 
[CsvEec::try_infer et 
al](https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/physical_plan/csv.rs#L123)
 , but it does not appear to be run prior to the planner panicing.
   
   I may have missed a better intended behavior that the CSV schema inference 
logic gets run at CREATE EXTERNAL TABLE planning time.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to