[jira] [Commented] (DRILL-7279) Support provided schema for CSV without headers

ASF GitHub Bot (JIRA) Mon, 03 Jun 2019 09:36:12 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16854784#comment-16854784
 ]


ASF GitHub Bot commented on DRILL-7279:
---------------------------------------

arina-ielchiieva commented on pull request #1798: DRILL-7279: Enable provided 
schema for text files without headers
URL: https://github.com/apache/drill/pull/1798#discussion_r289932538
 
 

 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/text/compliant/v3/TextParsingSettingsV3.java
 ##########
 @@ -17,45 +17,123 @@
  */
 package org.apache.drill.exec.store.easy.text.compliant.v3;
 
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.server.options.OptionManager;
+import org.apache.drill.exec.store.dfs.easy.EasySubScan;
 import org.apache.drill.exec.store.easy.text.TextFormatPlugin.TextFormatConfig;
-
 import org.apache.drill.shaded.guava.com.google.common.base.Charsets;
 
 // TODO: Remove the "V3" suffix once the V2 version is retired.
 public class TextParsingSettingsV3 {
 
-  public static final TextParsingSettingsV3 DEFAULT = new 
TextParsingSettingsV3();
-
-  private String emptyValue = null;
-  private boolean parseUnescapedQuotes = true;
-  private byte quote = b('"');
-  private byte quoteEscape = b('"');
-  private byte delimiter = b(',');
-  private byte comment = b('#');
-
-  private long maxCharsPerColumn = Character.MAX_VALUE;
-  private byte normalizedNewLine = b('\n');
-  private byte[] newLineDelimiter = {normalizedNewLine};
-  private boolean ignoreLeadingWhitespaces;
-  private boolean ignoreTrailingWhitespaces;
-  private String lineSeparatorString = "\n";
+  private final String emptyValue = null;
+  private final boolean parseUnescapedQuotes = true;
+  private final byte quote;
+  private final byte quoteEscape;
+  private final byte delimiter;
+  private final byte comment;
+
+  private final long maxCharsPerColumn = Character.MAX_VALUE;
+  private final byte normalizedNewLine = b('\n');
+  private final byte[] newLineDelimiter;
+  private final boolean ignoreLeadingWhitespaces = false;
+  private final boolean ignoreTrailingWhitespaces = false;
+  private final String lineSeparatorString = "\n";
   private boolean skipFirstLine;
 
-  private boolean headerExtractionEnabled;
-  private boolean useRepeatedVarChar = true;
-
-  public void set(TextFormatConfig config){
-    this.quote = bSafe(config.getQuote(), "quote");
-    this.quoteEscape = bSafe(config.getEscape(), "escape");
-    this.newLineDelimiter = config.getLineDelimiter().getBytes(Charsets.UTF_8);
-    this.delimiter = bSafe(config.getFieldDelimiter(), "fieldDelimiter");
-    this.comment = bSafe(config.getComment(), "comment");
-    this.skipFirstLine = config.isSkipFirstLine();
-    this.headerExtractionEnabled = config.isHeaderExtractionEnabled();
-    if (this.headerExtractionEnabled) {
-      // In case of header TextRecordReader will use set of VarChar vectors vs 
RepeatedVarChar
-      this.useRepeatedVarChar = false;
+  private final boolean headerExtractionEnabled;
+  private final boolean useRepeatedVarChar;
+  private final String providedHeaders[];
+
+  /**
+   * Configure the properties for this one scan based on:
+   * <p>
+   * <ul>
+   * <li>The defaults in the plugin config (if properties not defined
+   * in the config JSON.</li>
+   * <li>The config values from the config JSON as stored in the
+   * plugin config.</li>
+   * <li>Table function settings expressed in the query (and passed
+   * in as part of the plugin config.</li>
+   * <li>Table properties.</li>
+   * </ul>
+   * <p>
+   * The implementation does not use system/session properties, but
 
 Review comment:
   Not sure of value if this comment.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Support provided schema for CSV without headers
> -----------------------------------------------
>
>                 Key: DRILL-7279
>                 URL: https://issues.apache.org/jira/browse/DRILL-7279
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.16.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.17.0
>
>
> Extend the Drill 1.16 provided schema support for the text reader to allow a 
> provided schema for files without headers. Behavior:
> * If the file is configured to not extract headers, and a schema is provided, 
> and the schema has at least one column, then use the provided schema to 
> create individual columns. Otherwise, continue to use {{columns}} as in 
> previous versions.
> * The columns in the schema are assumed to match left-to-right with those in 
> the file.
> * If the schema contains more columns than the file, the extra columns take 
> their default values. (This occurs in schema evolution when a column is added 
> to newer files.)
> * If the file contains more columns than the schema, then the extra columns, 
> at the end of the line, are ignored. This is the same behavior as occurs if 
> the file contains headers.
> h4. Table Properties
> Also adds four table properties for text files. These properties, if present, 
> override those defined in the format plugin configuration. The properties 
> allow the user to have a single "csv" config, but to have many tables with 
> the "csv" suffix, each with different properties. That is, the user need not 
> define a new plugin config, and define a new extension, just to change a file 
> format property. With this system, the user can have a ".csv" file with 
> headers; the user need not define a different suffix (usually ".csvh" in 
> Drill) for this case.
> || Table Property || Equivalent Plugin Config Property ||
> | {{drill.headers}} | {{extractHeader}} |
> | {{drill.skipFirstLine}} |  {{skipFirstLine}} | 
> | {{drill.delimiter}} |  {{fieldDelimiter}} | 
> |  {{drill.commentChar}} |  {{comment}}| 
> For each, the rules are:
> * If the table property is not set, then the plugin property is used.
> * If the table property is set, then the property value replaces the plugin 
> property value for that one specific table.
> * For the delimiter, if the property value is an empty string, then this is 
> the same as an unset property.
> * For the comment, if the property value is an empty string, then the comment 
> is set to the ASCII NULL, which will never match. This effectively turns off 
> the comment feature for this one table.
> * If the delimiter or comment value is longer than a single character, only 
> the first character is used.
> It is possible to use the table properties without specifying a "provided" 
> schema. Just omit any columns from the schema:
> {noformat}
> create schema () for table `dfs.data`.`example`
> PROPERTIES ('drill.headers'='false', 'drill.skipFirstLine'='false', 
> 'drill.delimiter'='|')
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-7279) Support provided schema for CSV without headers

Reply via email to