[
https://issues.apache.org/jira/browse/TRAFODION-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609960#comment-15609960
]
ASF GitHub Bot commented on TRAFODION-2127:
-------------------------------------------
Github user DaveBirdsall commented on a diff in the pull request:
https://github.com/apache/incubator-trafodion/pull/772#discussion_r85226755
--- Diff: core/sql/executor/ExHdfsScan.cpp ---
@@ -1700,6 +1711,138 @@ char *
ExHdfsScanTcb::extractAndTransformAsciiSourceToSqlRow(int &err,
return NULL;
}
+void ExHdfsScanTcb::computeRangesAtRuntime()
+{
+ int numFiles = 0;
+ Int64 totalSize = 0;
+ Int64 myShare = 0;
+ Int64 runningSum = 0;
+ Int64 myStartPositionInBytes = 0;
+ Int64 firstFileStartingOffset = 0;
+ Int64 lastFileBytesToRead = -1;
+ Int32 numParallelInstances = MAXOF(getGlobals()->getNumOfInstances(),1);
+ hdfsFS fs = ((GetCliGlobals()->currContext())->getHdfsServerConnection(
+ hdfsScanTdb().hostName_,
+ hdfsScanTdb().port_));
+ hdfsFileInfo *fileInfos = hdfsListDirectory(fs,
+ hdfsScanTdb().hdfsRootDir_,
+ &numFiles);
+
+ if (runTimeRanges_)
+ deallocateRuntimeRanges();
+
+ // in a first round, count the total number of bytes
+ for (int f=0; f<numFiles; f++)
+ {
+ ex_assert(fileInfos[f].mKind == kObjectKindFile,
+ "subdirectories not supported with runtime HDFS ranges");
+ totalSize += (Int64) fileInfos[f].mSize;
+ }
+
+ // compute my share, in bytes
+ // (the last of the ESPs may read a bit more)
+ myShare = totalSize / numParallelInstances;
+ myStartPositionInBytes = myInstNum_ * myShare;
+ beginRangeNum_ = -1;
+ numRanges_ = 0;
+
+ if (totalSize > 0)
+ {
+ // second round, find out the range of files I need to read
+ for (int g=0; g<numFiles; g++)
+ {
+ Int64 prevSum = runningSum;
+
+ runningSum += (Int64) fileInfos[g].mSize;
+
+ if (runningSum >= myStartPositionInBytes)
+ {
+ if (beginRangeNum_ < 0)
+ {
+ // I have reached the first file that I need to read
+ beginRangeNum_ = g;
+ firstFileStartingOffset =
+ myStartPositionInBytes - prevSum;
+ }
+
+ numRanges_++;
+
+ if (runningSum > (myStartPositionInBytes + myShare) &&
+ myInstNum_ < numParallelInstances-1)
+ // the next file is beyond the range that I need to read
+ lastFileBytesToRead =
+ myStartPositionInBytes + myShare - prevSum;
+ break;
+ }
+ }
+
+ // now that we now how many ranges we need, allocate them
+ numRunTimeRanges_ = numRanges_;
+ runTimeRanges_ = new(getHeap()) HdfsFileInfo[numRunTimeRanges_];
+ }
+ else
+ beginRangeNum_ = 0;
+
+ // third round, populate the ranges that this ESP needs to read
+ for (int h=beginRangeNum_; h<beginRangeNum_+numRanges_; h++)
--- End diff --
In this design, two or more ESPs may read a section of the same file. Just
curious: Is there added latency or overhead introduced by doing this? (Old
thinking, I know; I have visions of disk head dances. Main memory files
wouldn't have an issue.) Also, we seem to be byte-oriented in our boundaries
without regard to record boundaries. I take it in some other layer we figure
out where records begin and end, and read across boundaries or after them as
needed?
> enhance Trafodion implementation of WITH clause
> -----------------------------------------------
>
> Key: TRAFODION-2127
> URL: https://issues.apache.org/jira/browse/TRAFODION-2127
> Project: Apache Trafodion
> Issue Type: Improvement
> Reporter: liu ming
> Assignee: Hans Zeller
>
> TRAFODION-1673 described some details about how to support WITH clause in
> Trafodion.
> As initial implementation, we use a simple pure-parser method.
> That way, Trafodion can support WITH clause functionally, but not good from
> performance point of view,
> also need to enhance the parser to be more strict in syntax.
> This JIRA is a follow up JIRA, to track following effort to support Trafodion
> WITH clause.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)