[jira] [Commented] (TRAFODION-2917) Refactor Trafodion implementation of hdfs scan for text formatted hive tables

ASF GitHub Bot (JIRA) Sat, 27 Jan 2018 11:49:56 -0800

    [ 
https://issues.apache.org/jira/browse/TRAFODION-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342285#comment-16342285
 ]


ASF GitHub Bot commented on TRAFODION-2917:
-------------------------------------------

Github user sureshsubbiah commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1417#discussion_r164279712
  
    --- Diff: core/sql/executor/HdfsClient_JNI.cpp ---
    @@ -0,0 +1,452 @@
    +//**********************************************************************
    +// @@@ START COPYRIGHT @@@
    +//
    +// Licensed to the Apache Software Foundation (ASF) under one
    +// or more contributor license agreements.  See the NOTICE file
    +// distributed with this work for additional information
    +// regarding copyright ownership.  The ASF licenses this file
    +// to you under the Apache License, Version 2.0 (the
    +// "License"); you may not use this file except in compliance
    +// with the License.  You may obtain a copy of the License at
    +//
    +//   http://www.apache.org/licenses/LICENSE-2.0
    +//
    +// Unless required by applicable law or agreed to in writing,
    +// software distributed under the License is distributed on an
    +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +// KIND, either express or implied.  See the License for the
    +// specific language governing permissions and limitations
    +// under the License.
    +//
    +// @@@ END COPYRIGHT @@@
    +// **********************************************************************
    +
    +#include "QRLogger.h"
    +#include "Globals.h"
    +#include "jni.h"
    +#include "HdfsClient_JNI.h"
    +
    +// 
===========================================================================
    +// ===== Class HdfsScan
    +// 
===========================================================================
    +
    +JavaMethodInit* HdfsScan::JavaMethods_ = NULL;
    +jclass HdfsScan::javaClass_ = 0;
    +bool HdfsScan::javaMethodsInitialized_ = false;
    +pthread_mutex_t HdfsScan::javaMethodsInitMutex_ = 
PTHREAD_MUTEX_INITIALIZER;
    +
    +static const char* const hdfsScanErrorEnumStr[] = 
    +{
    +};
    +
    + 
    
+//////////////////////////////////////////////////////////////////////////////
    +// 
    
+//////////////////////////////////////////////////////////////////////////////
    +HDFS_Scan_RetCode HdfsScan::init()
    +{
    +  static char className[]="org/trafodion/sql/HdfsScan";
    +  HDFS_Scan_RetCode rc; 
    +
    +  if (javaMethodsInitialized_)
    +    return (HDFS_Scan_RetCode)JavaObjectInterface::init(className, 
javaClass_, JavaMethods_, (Int32)JM_LAST, javaMethodsInitialized_); 
    +  else
    +  {
    +    pthread_mutex_lock(&javaMethodsInitMutex_);
    +    if (javaMethodsInitialized_)
    +    {
    +      pthread_mutex_unlock(&javaMethodsInitMutex_);
    +      return (HDFS_Scan_RetCode)JavaObjectInterface::init(className, 
javaClass_, JavaMethods_, (Int32)JM_LAST, javaMethodsInitialized_);
    +    }
    +    JavaMethods_ = new JavaMethodInit[JM_LAST];
    +    
    +    JavaMethods_[JM_CTOR      ].jm_name      = "<init>";
    +    JavaMethods_[JM_CTOR      ].jm_signature = "()V";
    +    JavaMethods_[JM_INIT_SCAN_RANGES].jm_name      = "<init>";
    +    JavaMethods_[JM_INIT_SCAN_RANGES].jm_signature = 
"(Ljava/lang/Object;Ljava/lang/Object;[Ljava/lang/String;[J[J)V";
    +    JavaMethods_[JM_TRAF_HDFS_READ].jm_name      = "trafHdfsRead";
    +    JavaMethods_[JM_TRAF_HDFS_READ].jm_signature = "()[I";
    +   
    +    rc = (HDFS_Scan_RetCode)JavaObjectInterface::init(className, 
javaClass_, JavaMethods_, (Int32)JM_LAST, javaMethodsInitialized_);
    +    javaMethodsInitialized_ = TRUE;
    +    pthread_mutex_unlock(&javaMethodsInitMutex_);
    +  }
    +  return rc;
    +}
    +        
    +char* HdfsScan::getErrorText(HDFS_Scan_RetCode errEnum)
    +{
    +  if (errEnum < (HDFS_Scan_RetCode)JOI_LAST)
    +    return JavaObjectInterface::getErrorText((JOI_RetCode)errEnum);
    +  else
    +    return (char*)hdfsScanErrorEnumStr[errEnum-HDFS_SCAN_FIRST-1];
    --- End diff --
    
    I wonder why there is a "-1" here, but no equivalent "-1"  in 
HdfsClient::getErrorText()


> Refactor Trafodion implementation of hdfs scan for text formatted hive tables
> -----------------------------------------------------------------------------
>
>                 Key: TRAFODION-2917
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2917
>             Project: Apache Trafodion
>          Issue Type: New Feature
>          Components: sql-general
>            Reporter: Selvaganesan Govindarajan
>            Priority: Major
>             Fix For: 2.3
>
>
> Find below the general outline of hdfs scan for text formatted hive tables.
> Compiler returns a list of scan ranges and the begin range and number of 
> ranges to be done by each instance of TCB in TDB. This list of scan ranges is 
> also re-computed at run time possibly based on a CQD
> The scan range for a TCB can come from the same or different hdfs files.  TCB 
> creates two threads to read these ranges.Two ranges (for the TCB) are 
> initially assigned to these threads. As and when a range is completed, the 
> next range (assigned for the TCB) is picked up by the thread. Ranges are read 
> in multiples of hdfs scan buffer size at the TCB level. Default hdfs scan 
> buffer size is 64 MB. Rows from hdfs scan buffer is processed and moved into 
> up queue. If the range contains a record split, then the range is extended to 
> read up to range tail IO size to get the full row. The range that had the 
> latter part of the row ignores it because the former range processes it. 
> Record split at the file level is not possible and/or not supported.
>  For compression, the compiler returns the range info such that the hdfs scan 
> buffer can hold the full uncompressed buffer.
>  Cons:
> Reader threads feature too complex to maintain in C++
> Error handling at the layer below the TCB is missing or errors are not 
> propagated to work method causing incorrect results
> Possible multiple copying of data
> Libhdfs calls are not optimized. It was observed that the method Ids are 
> being obtained many times. Need to check if this problem still exists.
> Now that we clearly know what is expected, it could be optimized better
>   - Reduced scan buffer size for smoother data flow
>   - Better thread utilization
>   - Avoid multiple copying of data.
> Unable to comprehend the need for two threads for pre-fetch especially when 
> one range is completed fully before the data from next range is processed.
>  Following are the hdfsCalls used by programs at exp and executor directory.
>                   U hdfsCloseFile
>                  U hdfsConnect
>                  U hdfsDelete
>                  U hdfsExists
>                  U hdfsFlush
>                  U hdfsFreeFileInfo
>                  U hdfsGetPathInfo
>                  U hdfsListDirectory
>                  U hdfsOpenFile
>                  U hdfsPread
>                  U hdfsRename
>                  U hdfsWrite
>                  U hdfsCreateDirectory
>  New implementation
>  Make changes to use direct Java APIs for these calls. However, come up with 
> better mechanism to move the data from Java and JNI, avoid unnecessary 
> copying of data, better thread management via Executor concepts in Java. 
> Hence it won’t be direct mapping of these calls to hdfs Java API. Instead, 
> use the abstraction like what is being done for HBase access.
>  I believe newer implementation will be optimized better and hence improved 
> performance. (but not many folds)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TRAFODION-2917) Refactor Trafodion implementation of hdfs scan for text formatted hive tables

Reply via email to