[I] Proposal for Systematic Configuration in 'Create External Table' Options [arrow-datafusion]

via GitHub Thu, 25 Jan 2024 03:00:49 -0800


metesynnada opened a new issue, #8994:
URL: https://github.com/apache/arrow-datafusion/issues/8994


   ### Is your feature request related to a problem or challenge?
   
   Currently, in our implementation of 'Create External Table', the 
configuration options are not systematically organized, leading to potential 
confusion and complexity for the users. This is especially evident when we 
compare our configuration pattern with other systems like Apache Flink and 
Apache Spark.
   
   For instance, in our current setup, wiring CSV format options and AWS 
credential settings are done in the same context. This approach lacks the 
clarity and structure found in similar systems. Examples of more systematic 
configurations can be seen in:
   
   - [Flink FileSystem 
table](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/connectors/table/filesystem/)
   - [Flink Kafka 
table](https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/connectors/table/kafka/)
   - [Spark FileSytem table 
(CSV)](https://spark.apache.org/docs/latest/sql-data-sources-csv.html)
   - [Spark Kafka 
Table](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html)
   
   Although Spark’s approach could be perceived as confusing when applied to 
our 'Create External Table' method, our method is currently more aligned with 
Spark's approach in terms of table creation.
   
   However, one aspect where our system shines is in our session context 
configuration. We utilize a more intuitive dot (.) divided pattern, like 
datafusion.execution.parquet.statistics_enabled. This is more user-friendly and 
logically structured.
   
   ### Describe the solution you'd like
   
   I propose we adopt a more structured and systematic approach in defining 
table options, similar to our session context configuration. For example, 
instead of the current format:
   
   ```sql
   CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
   OPTIONS ('AWS_ACCESS_KEY_ID' 'asdasd',
            'AWS_SECRET_ACCESS_KEY', 'asdasd',
            'timestamp_format' 'asdasd',
            'date_format' 'asdasd')
   ``` 
   
   
   We could structure it more clearly:
   
   ```sql
   CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
   OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
            'aws.credentials.basic.secretkey', 'asdasd',
            'format.csv.sink.timestamp_format' 'asdasd',
            'format.csv.sink.date_format' 'asdasd')
   ``` 
   
   Or even more detailed:
   
   ```sql
   CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV LOCATION 's3://boo/foo.csv'
   OPTIONS ('aws.credentials.basic.accesskeyid' 'asdasd',
            'aws.credentials.basic.secretkey', 'asdasd',
            'format.csv.scan.datetime_regex' 'asdasd',
            'format.csv.sink.timestamp_format' 'asdasd',
            'format.csv.sink.date_format' 'asdasd')
   ``` 
   
   This approach would separate AWS credentials from CSV format options and 
further delineate options for scanning and sinking, enhancing clarity and ease 
of use.
   
   Impact:
   - User Experience: This change will significantly improve user experience by 
making configuration more intuitive and easy to understand.
   - Documentation: Accompanying documentation will be necessary to guide users 
through the new configuration pattern.
   - Compatibility: It’s important to note that this change will introduce 
breaking changes. Thus, a clear migration path needs to be provided for 
existing users.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Proposal for Systematic Configuration in 'Create External Table' Options [arrow-datafusion]

Reply via email to