Github user traflm commented on a diff in the pull request:

    https://github.com/apache/incubator-trafodion/pull/441#discussion_r60336260
  
    --- Diff: core/sql/executor/ExHdfsScan.cpp ---
    @@ -1378,9 +1423,27 @@ ExWorkProcRetcode ExHdfsScanTcb::work()
       
       return WORK_OK;
     }
    +void ExHdfsScanTcb::setR2ColumnNull(Int32 colidx)
    +{
    +  Lng32 neededColIndex = 0;
    +  Attributes * attr = NULL;
    +  ExpTupleDesc * asciiSourceTD = 
hdfsScanTdb().workCriDesc_->getTupleDescriptor(hdfsScanTdb().asciiTuppIndex_);
    +  for (Lng32 i = 0; i <  hdfsScanTdb().convertSkipListSize_; i++)
    +  {
    +    if (hdfsScanTdb().convertSkipList_[i] > 0)
    +    {
    +      attr = asciiSourceTD->getAttr(neededColIndex);
    +      neededColIndex++;
    +     if(attr->getOffset() == colidx)
    +   {
    +           *(short *)&hdfsAsciiSourceData_[attr->getNullIndOffset()] = -1;
    +   }
    +    }
    +  }
    +}
    --- End diff --
    
    thanks Dave,
    Please refer to the very excellent comments above src code in 
FileScan::codeGenForHive(). 
    
    The hdfsAsciiSourceData_ is the pointer to row 'R2' referred to in those 
comments. It is in a special internal tuple format called 'exploded format 
row', each field in this tuple contains at least two parts as far as I know: 
first byte is null-indicator, when it is 0, means not null, when it is 0xFF 
(-1), it means null. After the one byte null indicator, it follows 8 bytes of 
value, but it is not the real value, just a pointer to R1. When need, it needs 
to deference the pointer.  R1 is a buffer containing raw data read from HDFS 
file. R3 is for predicate, R4 is for projection and return rows.
    So set 0xff here will set required field in R2 as null, in this case, it is 
because it contains invalid data, like 'abc' for an Integer field. And the 
moveColsConvertExpr expression failed to convert it into an integer. So the 
logic go back here , calling this method to set the invalid data field to null. 
And try again. Because the evaluation is now done row by row, not column by 
column, so this is the simplest way to change the logic at present . There are 
better ways, but without more guidance, I cannot do those, for example: do this 
converting column by column, this requires a huge change in compiler to 
generate different pCode and expression tree, which beyond my knowledge.
    I didn't change any current normal logic.
    
    The move expression typically will generate a piece of pCode instructions, 
if the row has 3 columns, the generated pCode will have 3 sets of instructions 
like this(instruction name is not valid, just try to explain what I want to 
mean)
    IFNULL
      INST1
    ELSE
      INST2
    I modified INST2 to save the offset of invalid data offset in 'R2', and 
everything else keep same.
    If pCode is turned off, the expression will be evaluated in normal clause 
evaluation path. It is a list of clauses, each clause correspond to an 
operation, and each column has a convert clause to do the data converting if 
needed. This is generated inside compiler, I don't know the details, parser 
should generate a tree of ItemExpr, and gencode will generate correct 
expression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to