[GitHub] drill pull request #819: DRILL-5419: Calculate return string length for lite...

paul-rogers Thu, 27 Apr 2017 18:29:48 -0700

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/819#discussion_r113833863
  
    --- Diff: common/src/main/java/org/apache/drill/common/types/Types.java ---
    @@ -636,43 +658,63 @@ public static String toString(final MajorType type) {
     
       /**
        * Get the <code>precision</code> of given type.
    -   * @param majorType
    -   * @return
    +   *
    +   * @param majorType major type
    +   * @return precision value
        */
       public static int getPrecision(MajorType majorType) {
    -    MinorType type = majorType.getMinorType();
    -
    -    if (type == MinorType.VARBINARY || type == MinorType.VARCHAR) {
    -      return 65536;
    -    }
    -
         if (majorType.hasPrecision()) {
           return majorType.getPrecision();
         }
     
    -    return 0;
    +    return isScalarStringType(majorType) ? MAX_VARCHAR_LENGTH : UNDEFINED;
    --- End diff --
    
    My point is that calculating things based on metadata only works in systems 
that define metadata. Drill does not. In the vast majority of the cases, the 
length of a string column is unknown: there is no system catalog to tell us the 
expected (or maximum) length. We can guess as you suggest, but that only helps 
in the few cases where someone uses constants or functions. It does not help:
    
    SELECT myStringCol FROM myTable
    
    The only reliable way to estimate the column width is to sample data. But, 
of course, doing that is complex: we can't sample at the output (Screen), else 
we'd have to run, say, an entire sort. So, we have to sample at the *input*: in 
the scanner.
    
    In particular, we need each scanner to:
    
    1. Read one batch of data.
    2. Use that data to build the schema (as we already do.)
    3. Use that data to estimate maximum string column width.
    4. Send the schema (only) downstream as part of the fast path.
    5. On the second call to next(), return the cached record batch.
    6. On the third call to next(), read a new record batch.
    
    An open question is this: if the first batch sees only strings of 10 
characters, say, but the second batch sees strings of 100 characters, is this a 
schema change?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #819: DRILL-5419: Calculate return string length for lite...

Reply via email to