Github user traflm commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/441#discussion_r60336260
--- Diff: core/sql/executor/ExHdfsScan.cpp ---
@@ -1378,9 +1423,27 @@ ExWorkProcRetcode ExHdfsScanTcb::work()
return WORK_OK;
}
+void ExHdfsScanTcb::setR2ColumnNull(Int32 colidx)
+{
+ Lng32 neededColIndex = 0;
+ Attributes * attr = NULL;
+ ExpTupleDesc * asciiSourceTD =
hdfsScanTdb().workCriDesc_->getTupleDescriptor(hdfsScanTdb().asciiTuppIndex_);
+ for (Lng32 i = 0; i < hdfsScanTdb().convertSkipListSize_; i++)
+ {
+ if (hdfsScanTdb().convertSkipList_[i] > 0)
+ {
+ attr = asciiSourceTD->getAttr(neededColIndex);
+ neededColIndex++;
+ if(attr->getOffset() == colidx)
+ {
+ *(short *)&hdfsAsciiSourceData_[attr->getNullIndOffset()] = -1;
+ }
+ }
+ }
+}
--- End diff --
thanks Dave,
Please refer to the very excellent comments above src code in
FileScan::codeGenForHive().
The hdfsAsciiSourceData_ is the pointer to row 'R2' referred to in those
comments. It is in a special internal tuple format called 'exploded format
row', each field in this tuple contains at least two parts as far as I know:
first byte is null-indicator, when it is 0, means not null, when it is 0xFF
(-1), it means null. After the one byte null indicator, it follows 8 bytes of
value, but it is not the real value, just a pointer to R1. When need, it needs
to deference the pointer. R1 is a buffer containing raw data read from HDFS
file. R3 is for predicate, R4 is for projection and return rows.
So set 0xff here will set required field in R2 as null, in this case, it is
because it contains invalid data, like 'abc' for an Integer field. And the
moveColsConvertExpr expression failed to convert it into an integer. So the
logic go back here , calling this method to set the invalid data field to null.
And try again. Because the evaluation is now done row by row, not column by
column, so this is the simplest way to change the logic at present . There are
better ways, but without more guidance, I cannot do those, for example: do this
converting column by column, this requires a huge change in compiler to
generate different pCode and expression tree, which beyond my knowledge.
I didn't change any current normal logic.
The move expression typically will generate a piece of pCode instructions,
if the row has 3 columns, the generated pCode will have 3 sets of instructions
like this(instruction name is not valid, just try to explain what I want to
mean)
IFNULL
INST1
ELSE
INST2
I modified INST2 to save the offset of invalid data offset in 'R2', and
everything else keep same.
If pCode is turned off, the expression will be evaluated in normal clause
evaluation path. It is a list of clauses, each clause correspond to an
operation, and each column has a convert clause to do the data converting if
needed. This is generated inside compiler, I don't know the details, parser
should generate a tree of ItemExpr, and gencode will generate correct
expression.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---