[I] Add Support for Multivalued Copy Options in DFParser [arrow-datafusion]

via GitHub Mon, 19 Feb 2024 09:05:49 -0800


devinjdangelo opened a new issue, #9274:
URL: https://github.com/apache/arrow-datafusion/issues/9274


   ### Is your feature request related to a problem or challenge?
   
   The partition_by COPY option is multivalued, e.g.:
   
   ```sql
   COPY table to file.parquet (partition_by 'a,b,c')
   ```
   
   This is handled currently by passing a comma separated string literal to the 
COPY statement which is parsed later during planning by splitting on the comma. 
The current parsing is not as robust at handling edge cases (e.g. it won't 
handle a column name which itself contains a comma). 
   
   Other systems (e.g. DuckDB), have a special syntax for partition_by option 
https://duckdb.org/docs/data/partitioning/partitioned_writes.html:
   
   ```sql
   COPY table to file.parquet (partition_by (a,b,c))
   ```
   
   We could support this same syntax with parser updates.
   
   ### Describe the solution you'd like
   
   Add support for multivalued COPY options in DFParser. E.g.
   
   ```rust
   #[derive(Debug, Clone, PartialEq, Eq)]
   pub struct CopyToStatement {
       /// From where the data comes from
       pub source: CopyToSource,
       /// The URL to where the data is heading
       pub target: String,
       /// Target specific options
       pub options: Vec<(String, CopyToOptionValue)>,
   }
   
   #[derive(Debug, Clone, PartialEq, Eq)]
   pub enum CopyToOptionValue {
       /// A single [Value], e.g. (format parquet)
       Single(Value),
       /// A list of [Value]s, e.g. (partition_by ("a", "b", "c"))
       List(Vec<String>),
   }
   
   pub fn parse_option_value(&mut self) -> Result<CopyToOptionValue, 
ParserError> {
     let next_token = self.parser.peek_token();
     match next_token.token {
         Token::Word(Word { value, .. }) => {
             self.parser.next_token();
             Ok(CopyToOptionValue::Single(Value::UnQuotedString(value)))
         },
         Token::SingleQuotedString(s) => {
             self.parser.next_token();
             Ok(CopyToOptionValue::Single(Value::SingleQuotedString(s)))
         },
         Token::DoubleQuotedString(s) => {
             self.parser.next_token();
             Ok(CopyToOptionValue::Single(Value::DoubleQuotedString(s)))
         },
         Token::EscapedStringLiteral(s) => {
             self.parser.next_token();
             Ok(CopyToOptionValue::Single(Value::EscapedStringLiteral(s)))
         },
         Token::Number(ref n, l) => {
             self.parser.next_token();
             match n.parse() {
                 Ok(n) => Ok(CopyToOptionValue::Single(Value::Number(n, l))),
                 // The tokenizer should have ensured `n` is an integer
                 // so this should not be possible
                 Err(e) => parser_err!(format!(
                     "Unexpected error: could not parse '{n}' as number: {e}"
                 )),
             }},
         Token::LParen => {
             Ok(CopyToOptionValue::List(self.parse_partitions()?))
         },
         _ => self.parser.expected("string or numeric value", next_token),
     }
   }
   ```
   
   The CopyTo logical plan will also need to be updated to accept multi valued 
options. This will require a good amount of work to rewire the code to handle 
the possibility of multi valued options.
   
   ### Describe alternatives you've considered
   
   Keep the parser and logical plan as-is. Partitioning by columns containing 
commas may be a rare enough special case that we can simply not support it. 
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Add Support for Multivalued Copy Options in DFParser [arrow-datafusion]

Reply via email to