[
https://issues.apache.org/jira/browse/TRAFODION-2917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16342285#comment-16342285
]
ASF GitHub Bot commented on TRAFODION-2917:
-------------------------------------------
Github user sureshsubbiah commented on a diff in the pull request:
https://github.com/apache/trafodion/pull/1417#discussion_r164279712
--- Diff: core/sql/executor/HdfsClient_JNI.cpp ---
@@ -0,0 +1,452 @@
+//**********************************************************************
+// @@@ START COPYRIGHT @@@
+//
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+//
+// @@@ END COPYRIGHT @@@
+// **********************************************************************
+
+#include "QRLogger.h"
+#include "Globals.h"
+#include "jni.h"
+#include "HdfsClient_JNI.h"
+
+//
===========================================================================
+// ===== Class HdfsScan
+//
===========================================================================
+
+JavaMethodInit* HdfsScan::JavaMethods_ = NULL;
+jclass HdfsScan::javaClass_ = 0;
+bool HdfsScan::javaMethodsInitialized_ = false;
+pthread_mutex_t HdfsScan::javaMethodsInitMutex_ =
PTHREAD_MUTEX_INITIALIZER;
+
+static const char* const hdfsScanErrorEnumStr[] =
+{
+};
+
+
+//////////////////////////////////////////////////////////////////////////////
+//
+//////////////////////////////////////////////////////////////////////////////
+HDFS_Scan_RetCode HdfsScan::init()
+{
+ static char className[]="org/trafodion/sql/HdfsScan";
+ HDFS_Scan_RetCode rc;
+
+ if (javaMethodsInitialized_)
+ return (HDFS_Scan_RetCode)JavaObjectInterface::init(className,
javaClass_, JavaMethods_, (Int32)JM_LAST, javaMethodsInitialized_);
+ else
+ {
+ pthread_mutex_lock(&javaMethodsInitMutex_);
+ if (javaMethodsInitialized_)
+ {
+ pthread_mutex_unlock(&javaMethodsInitMutex_);
+ return (HDFS_Scan_RetCode)JavaObjectInterface::init(className,
javaClass_, JavaMethods_, (Int32)JM_LAST, javaMethodsInitialized_);
+ }
+ JavaMethods_ = new JavaMethodInit[JM_LAST];
+
+ JavaMethods_[JM_CTOR ].jm_name = "<init>";
+ JavaMethods_[JM_CTOR ].jm_signature = "()V";
+ JavaMethods_[JM_INIT_SCAN_RANGES].jm_name = "<init>";
+ JavaMethods_[JM_INIT_SCAN_RANGES].jm_signature =
"(Ljava/lang/Object;Ljava/lang/Object;[Ljava/lang/String;[J[J)V";
+ JavaMethods_[JM_TRAF_HDFS_READ].jm_name = "trafHdfsRead";
+ JavaMethods_[JM_TRAF_HDFS_READ].jm_signature = "()[I";
+
+ rc = (HDFS_Scan_RetCode)JavaObjectInterface::init(className,
javaClass_, JavaMethods_, (Int32)JM_LAST, javaMethodsInitialized_);
+ javaMethodsInitialized_ = TRUE;
+ pthread_mutex_unlock(&javaMethodsInitMutex_);
+ }
+ return rc;
+}
+
+char* HdfsScan::getErrorText(HDFS_Scan_RetCode errEnum)
+{
+ if (errEnum < (HDFS_Scan_RetCode)JOI_LAST)
+ return JavaObjectInterface::getErrorText((JOI_RetCode)errEnum);
+ else
+ return (char*)hdfsScanErrorEnumStr[errEnum-HDFS_SCAN_FIRST-1];
--- End diff --
I wonder why there is a "-1" here, but no equivalent "-1" in
HdfsClient::getErrorText()
> Refactor Trafodion implementation of hdfs scan for text formatted hive tables
> -----------------------------------------------------------------------------
>
> Key: TRAFODION-2917
> URL: https://issues.apache.org/jira/browse/TRAFODION-2917
> Project: Apache Trafodion
> Issue Type: New Feature
> Components: sql-general
> Reporter: Selvaganesan Govindarajan
> Priority: Major
> Fix For: 2.3
>
>
> Find below the general outline of hdfs scan for text formatted hive tables.
> Compiler returns a list of scan ranges and the begin range and number of
> ranges to be done by each instance of TCB in TDB. This list of scan ranges is
> also re-computed at run time possibly based on a CQD
> The scan range for a TCB can come from the same or different hdfs files. TCB
> creates two threads to read these ranges.Two ranges (for the TCB) are
> initially assigned to these threads. As and when a range is completed, the
> next range (assigned for the TCB) is picked up by the thread. Ranges are read
> in multiples of hdfs scan buffer size at the TCB level. Default hdfs scan
> buffer size is 64 MB. Rows from hdfs scan buffer is processed and moved into
> up queue. If the range contains a record split, then the range is extended to
> read up to range tail IO size to get the full row. The range that had the
> latter part of the row ignores it because the former range processes it.
> Record split at the file level is not possible and/or not supported.
> For compression, the compiler returns the range info such that the hdfs scan
> buffer can hold the full uncompressed buffer.
> Cons:
> Reader threads feature too complex to maintain in C++
> Error handling at the layer below the TCB is missing or errors are not
> propagated to work method causing incorrect results
> Possible multiple copying of data
> Libhdfs calls are not optimized. It was observed that the method Ids are
> being obtained many times. Need to check if this problem still exists.
> Now that we clearly know what is expected, it could be optimized better
> - Reduced scan buffer size for smoother data flow
> - Better thread utilization
> - Avoid multiple copying of data.
> Unable to comprehend the need for two threads for pre-fetch especially when
> one range is completed fully before the data from next range is processed.
> Following are the hdfsCalls used by programs at exp and executor directory.
> U hdfsCloseFile
> U hdfsConnect
> U hdfsDelete
> U hdfsExists
> U hdfsFlush
> U hdfsFreeFileInfo
> U hdfsGetPathInfo
> U hdfsListDirectory
> U hdfsOpenFile
> U hdfsPread
> U hdfsRename
> U hdfsWrite
> U hdfsCreateDirectory
> New implementation
> Make changes to use direct Java APIs for these calls. However, come up with
> better mechanism to move the data from Java and JNI, avoid unnecessary
> copying of data, better thread management via Executor concepts in Java.
> Hence it won’t be direct mapping of these calls to hdfs Java API. Instead,
> use the abstraction like what is being done for HBase access.
> I believe newer implementation will be optimized better and hence improved
> performance. (but not many folds)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)