Github user zellerh commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/496#discussion_r64430466
--- Diff: core/sql/optimizer/NATable.cpp ---
@@ -3553,6 +3553,48 @@ NAType* getSQColTypeForHive(const char* hiveType,
NAMemory* heap)
if ( !strcmp(hiveType, "timestamp"))
return new (heap) SQLTimestamp(TRUE /* allow NULL */ , 6, heap);
+ if ( !strcmp(hiveType, "date"))
+ return new (heap) SQLDate(TRUE /* allow NULL */ , heap);
+
+ if ( !strncmp(hiveType, "varchar", 7) )
+ {
+ char maxLen[32];
+ memset(maxLen, 0, 32);
+ int i=0,j=0;
+ int copyit = 0;
+
+ //get length
+ for(i = 0; i < strlen(hiveType) ; i++)
+ {
+ if(hiveType[i] == '(') //start
+ {
+ copyit=1;
+ continue;
+ }
+ else if(hiveType[i] == ')') //stop
+ break;
+ if(copyit > 0)
+ {
+ maxLen[j] = hiveType[i];
+ j++;
+ }
+ }
+ Int32 len = atoi(maxLen);
+
+ if(len == 0) return NULL; //cannot parse correctly
+
+ NAString hiveCharset =
+ ActiveSchemaDB()->getDefaults().getValue(HIVE_DEFAULT_CHARSET);
+
+ return new (heap) SQLVarChar(CharLenInfo((hiveCharset ==
CharInfo::UTF8 ? 0 : len),len),
--- End diff --
This creates a type VARCHAR(n BYTES) CHARACTER SET UTF8, if
HIVE_DEFAULT_CHARSET is UTF8. Hive, however, does not have this optimization,
and it will allow n *characters* in the column. I hate to waste so much space,
but I think we need to multiply the second "len" by the max length in bytes of
a character.
I just also note that this code compares the string hiveCharset with
CharInfo::UTF8, which is a number. You copied this code from the "string" case,
which also may be wrong. We should be converting this to upper case and
comparing it to SQLCHARSETSTRING_UTF8. It probably works by chance, the
comparison is always FALSE, so we always pass "len" to the CharLenInfo
constructor, but this is irrelevant for anything other than UTF8.
Here is my suggested code:
```C++
hiveCharset.toUpper();
CharInfo::CharSet hiveCharsetEnum = CharInfo::getCharSetEnum(hiveCharset);
Int32 maxNumChars = 0;
Int32 storageLen = len;
```
The following applies to the VARCHAR case only, but the code above and
below should also be changed for the STRING case:
```C++
if (CharInfo::isVariableWidthMultiByteCharSet(hiveCharsetEnum))
{
// For Hive VARCHARs, the number specified is the max. number of
characters,
// while we count in bytes when using HIVE_MAX_STRING_LENGTH for Hive
STRING
// columns. Set the max character constraint and also adjust the
required storage length.
maxNumChars = len;
storageLen = len * CharInfo::maxBytesPerChar(hiveCharsetEnum);
}
```
Again, this applies to both cases:
```C++
return new (heap) SQLVarChar(CharLenInfo(maxNumChars, storageLen),
...
```
Note that we are now treating STRING and VARCHAR differently. Let's say
HIVE_MAX_STRING_LENGTH is set to 32.
STRING becomes VARCHAR(32 BYTES) CHARACTER SET UTF8.
VARCHAR(32) becomes VARCHAR(32) CHARACTER SET UTF8.
Is that ok or do we need still more CQDs or a different default behavior?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---