[ 
https://issues.apache.org/jira/browse/DRILL-6096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955965#comment-16955965
 ] 

ASF GitHub Bot commented on DRILL-6096:
---------------------------------------

arina-ielchiieva commented on issue #1873: DRILL-6096: Provide mechanism to 
configure text writer configuration
URL: https://github.com/apache/drill/pull/1873#issuecomment-544469217
 
 
   @paul-rogers thanks for the code review, addressed code review comments, 
force-pushed since there were minor changes in the code. 
   
   Regarding design, the aim of this Jira was just to fix text writer to write 
proper text files: before if column contained field separator, field was not 
enclosed in the quotes, thus we were writing text files which Drill could not 
read. Now when user indicates write format using session option (this is common 
approach for all formats), Drill produces text files, it can read back. 
Basically, if user has configured format plugin:
   ```
     "formats": {
       "csvh": {
         "type": "text",
         "extensions": [
           "csvh"
         ],
         "lineDelimiter": "\n",
         "fieldDelimiter": ",",
         "extractHeader": true
       }
      },
   ```
   Drill will be able to read and write such text files correctly. Same 
approach is used for `parquet`, `json`. All user needs to do is to indicate 
write format using session option: `alter session set `store.format` = 'csvh';` 
(`parquet`, `json`). I am not saying this is ideal and we might need to 
reconsider such writing approach but I guess not in the scope of Jira since 
such re-design would touch all file writers.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Provide mechanisms to specify field delimiters and quoted text for 
> TextRecordWriter
> -----------------------------------------------------------------------------------
>
>                 Key: DRILL-6096
>                 URL: https://issues.apache.org/jira/browse/DRILL-6096
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Text & CSV
>    Affects Versions: 1.12.0
>            Reporter: Kunal Khatua
>            Assignee: Arina Ielchiieva
>            Priority: Major
>              Labels: doc-impacting, ready-to-commit
>             Fix For: 1.17.0
>
>
> Currently, there is no way for a user to specify theĀ field delimiter for the 
> writing records as a text output. Further more, if the fields contain the 
> delimiter, we have no mechanism of specifying quotes.
> By default, quotes should be used to enclose non-numeric fields being written.
> *Description of the implemented changes:*
> 2 options are added to control text writer output:
> {{store.text.writer.add_header}} - indicates if header should be added in 
> created text file. Default is true.
> {{store.text.writer.force_quotes}} - indicates if all value should be quoted. 
> Default is false. It means only values that contain special characters (line 
> / field separators) will be quoted.
> Line / field separators, quote / escape characters can be configured using 
> text format configuration using Web UI. User can create special format only 
> for writing data and then use it when creating files. Though such format can 
> be always used to read back written data.
> {noformat}
>   "formats": {
>     "write_text": {
>       "type": "text",
>       "extensions": [
>         "txt"
>       ],
>       "lineDelimiter": "\n",
>       "fieldDelimiter": "!",
>       "quote": "^",
>       "escape": "^",
>     }
>    },
> ...
> {noformat}
> Next set specified format and create text file:
> {noformat}
> alter session set `store.format` = 'write_text';
> create table dfs.tmp.t as select 1 as id from (values(1));
> {noformat}
> Notes:
> 1. To write data univocity-parsers are used, they limit line separator length 
> to not more than 2 characters, though Drill allows setting more 2 chars as 
> line separator since Drill can read data splitting by line separator of any 
> length, during data write exception will be thrown.
> 2. {{extractHeader}} in text format configuration does not affect if header 
> will be written to text file, only {{store.text.writer.add_header}} controls 
> this action. {{extractHeader}} is used only when reading the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to