[GitHub] incubator-trafodion pull request: [Trafodion 762] support HIVE dat...

zellerh Tue, 24 May 2016 09:51:24 -0700

Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/496#discussion_r64430466
  
    --- Diff: core/sql/optimizer/NATable.cpp ---
    @@ -3553,6 +3553,48 @@ NAType* getSQColTypeForHive(const char* hiveType, 
NAMemory* heap)
       if ( !strcmp(hiveType, "timestamp"))
         return new (heap) SQLTimestamp(TRUE /* allow NULL */ , 6, heap);
     
    +  if ( !strcmp(hiveType, "date"))
    +    return new (heap) SQLDate(TRUE /* allow NULL */ , heap);
    +
    +  if ( !strncmp(hiveType, "varchar", 7) )
    +  {
    +    char maxLen[32];
    +    memset(maxLen, 0, 32);
    +    int i=0,j=0;
    +    int copyit = 0;
    +
    +    //get length
    +    for(i = 0; i < strlen(hiveType) ; i++)
    +    {
    +      if(hiveType[i] == '(') //start
    +      {
    +        copyit=1;
    +        continue;
    +      }
    +      else if(hiveType[i] == ')') //stop
    +        break; 
    +      if(copyit > 0)
    +      {
    +        maxLen[j] = hiveType[i];
    +        j++;
    +      }
    +    }
    +    Int32 len = atoi(maxLen);
    +
    +    if(len == 0) return NULL;  //cannot parse correctly
    +
    +    NAString hiveCharset =
    +        ActiveSchemaDB()->getDefaults().getValue(HIVE_DEFAULT_CHARSET);
    +
    +    return new (heap) SQLVarChar(CharLenInfo((hiveCharset == 
CharInfo::UTF8 ? 0 : len),len),
    --- End diff --
    
    This creates a type VARCHAR(n BYTES) CHARACTER SET UTF8, if 
HIVE_DEFAULT_CHARSET is UTF8. Hive, however, does not have this optimization, 
and it will allow n *characters* in the column. I hate to waste so much space, 
but I think we need to multiply the second "len" by the max length in bytes of 
a character.
    
    I just also note that this code compares the string hiveCharset with 
CharInfo::UTF8, which is a number. You copied this code from the "string" case, 
which also may be wrong. We should be converting this to upper case and 
comparing it to SQLCHARSETSTRING_UTF8. It probably works by chance, the 
comparison is always FALSE, so we always pass "len" to the CharLenInfo 
constructor, but this is irrelevant for anything other than UTF8.
    
    Here is my suggested code:
    
    ```C++
    hiveCharset.toUpper();
    CharInfo::CharSet hiveCharsetEnum = CharInfo::getCharSetEnum(hiveCharset);
    Int32 maxNumChars = 0;
    Int32 storageLen = len;
    ```
    
    The following applies to the VARCHAR case only, but the code above and 
below should also be changed for the STRING case:
    
    ```C++
    if (CharInfo::isVariableWidthMultiByteCharSet(hiveCharsetEnum))
      {
        // For Hive VARCHARs, the number specified is the max. number of 
characters,
        // while we count in bytes when using HIVE_MAX_STRING_LENGTH for Hive 
STRING
        // columns. Set the max character constraint and also adjust the 
required storage length.
        maxNumChars = len;
        storageLen = len * CharInfo::maxBytesPerChar(hiveCharsetEnum);
      }
    ```
    
    Again, this applies to both cases:
    
    ```C++
    return new (heap) SQLVarChar(CharLenInfo(maxNumChars, storageLen),
                                 ...
    ```
    
    Note that we are now treating STRING and VARCHAR differently. Let's say 
HIVE_MAX_STRING_LENGTH is set to 32.
    
    STRING becomes VARCHAR(32 BYTES) CHARACTER SET UTF8.
    
    VARCHAR(32) becomes VARCHAR(32) CHARACTER SET UTF8.
    
    Is that ok or do we need still more CQDs or a different default behavior?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-trafodion pull request: [Trafodion 762] support HIVE dat...

Reply via email to