[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

GitBox Mon, 25 Jan 2021 20:24:12 -0800


mustafaiman commented on a change in pull request #1823:
URL: https://github.com/apache/hive/pull/1823#discussion_r563956041




##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -4509,7 +4509,7 @@ private static void populateLlapDaemonVarsSet(Set<String> 
llapDaemonVarsSetLocal
         "Minimum allocation possible from LLAP buddy allocator. Allocations 
below that are\n" +
         "padded to minimum allocation. For ORC, should generally be the same 
as the expected\n" +
         "compression buffer size, or next lowest power of 2. Must be a power 
of 2."),
-    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "16Mb", new 
SizeValidator(),
+    LLAP_ALLOCATOR_MAX_ALLOC("hive.llap.io.allocator.alloc.max", "4Mb", new 
SizeValidator(),

Review comment:
       You say this is changed to be compatible with ORC setting. I do not 
understand why this is necessary and what its impact is. This looks like a 
change that is not to be taken lightly

##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/LlapRecordReaderUtils.java
##########
@@ -0,0 +1,438 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.llap.io.encoded;
+
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.hadoop.hive.ql.io.orc.encoded.LlapDataReader;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.CompressionKind;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.BufferChunk;
+import org.apache.orc.impl.DataReaderProperties;
+import org.apache.orc.impl.DirectDecompressionCodec;
+import org.apache.orc.impl.HadoopShims;
+import org.apache.orc.impl.HadoopShimsFactory;
+import org.apache.orc.impl.InStream;
+import org.apache.orc.impl.OrcCodecPool;
+import org.apache.orc.impl.OrcIndex;
+import org.apache.orc.impl.RecordReaderUtils;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.function.Supplier;
+
+public class LlapRecordReaderUtils {
+
+  private static final HadoopShims SHIMS = HadoopShimsFactory.get();
+  private static final Logger LOG = 
LoggerFactory.getLogger(LlapRecordReaderUtils.class);
+
+  static HadoopShims.ZeroCopyReaderShim createZeroCopyShim(FSDataInputStream 
file, CompressionCodec codec, RecordReaderUtils.ByteBufferAllocatorPool pool) 
throws IOException {
+    return codec != null && (!(codec instanceof DirectDecompressionCodec) || 
!((DirectDecompressionCodec)codec).isAvailable()) ? null : 
SHIMS.getZeroCopyReader(file, pool);

Review comment:
       Can you invert this condition. There are a lot of negatives making this 
hard to understand
   `codec == null || (codec instanceof DirectDecompressionCodec && 
((DirectDecompressionCodec) codec).isAvailable()) ? 
SHIMS.getZeroCopyReader(file, pool) : null` is much more understandable

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -443,7 +444,8 @@ public void setBaseAndInnerReader(
       return new OrcRawRecordMerger.KeyInterval(null, null);
     }
 
-    OrcTail orcTail = getOrcTail(orcSplit.getPath(), conf, cacheTag, 
orcSplit.getFileKey()).orcTail;
+    VectorizedOrcAcidRowBatchReader.ReaderData orcReaderData =

Review comment:
       I am not sure about this. Previously we did not create the full reader. 
Why do we need to create the reader now? All calls from here use orcTail anyway 
except `List<StripeStatistics> stats = 
orcReaderData.reader.getVariantStripeStatistics(null);`. However, we can create 
this also from the info in tail, too.

##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() 
throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                
.withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", 
orcTail.getTailBuffer(),

Review comment:
       Please extract "stripe stats" to a constant.
   I understand the correct way to get the stripe stats is to get it from 
OrcReader.getStripeStats. However, we want to avoid creating a reader when we 
have the metadata in cache. That is why we do this bit here as far as I 
understand. It would be better to extract this part to a helper method with 
some javadoc explaining why we do this.

##########
File path: 
llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java
##########
@@ -631,10 +630,19 @@ private OrcFileMetadata getFileFooterFromCacheOrDisk() 
throws IOException {
           OrcTail orcTail = getOrcTailFromLlapBuffers(tailBuffers);
           counters.incrCounter(LlapIOCounters.METADATA_CACHE_HIT);
           FileTail tail = orcTail.getFileTail();
-          stats = orcTail.getStripeStatisticsProto();
+          CompressionKind compressionKind = orcTail.getCompressionKind();
+          InStream.StreamOptions options = null;
+          if (compressionKind != CompressionKind.NONE) {
+            options = InStream.options()
+                
.withCodec(OrcCodecPool.getCodec(compressionKind)).withBufferSize(orcTail.getCompressionBufferSize());
+          }
+          InStream stream = InStream.create("stripe stats", 
orcTail.getTailBuffer(),
+              orcTail.getMetadataOffset(), orcTail.getMetadataSize(), options);
+          stats = 
OrcProto.Metadata.parseFrom(InStream.createCodedInputStream(stream)).getStripeStatsList();
           stripes = new ArrayList<>(tail.getFooter().getStripesCount());
+          int stripeIdx = 0;
           for (OrcProto.StripeInformation stripeProto : 
tail.getFooter().getStripesList()) {
-            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto));
+            stripes.add(new ReaderImpl.StripeInformationImpl(stripeProto, 
stripeIdx++, -1, null));

Review comment:
       I am confused with this part. previouse stripe id and encryption keys 
are never relevant?

##########
File path: ql/src/java/org/apache/hadoop/hive/ql/io/orc/LocalCache.java
##########
@@ -82,8 +82,7 @@ public void put(Path path, OrcTail tail) {
     if (bb.capacity() != bb.remaining()) {
       throw new RuntimeException("Bytebuffer allocated for path: " + path + " 
has remaining: " + bb.remaining() + " != capacity: " + bb.capacity());
     }
-    cache.put(path, new TailAndFileData(tail.getFileTail().getFileLength(),
-        tail.getFileModificationTime(), bb.duplicate()));
+    cache.put(path, new TailAndFileData(bb.limit(), 
tail.getFileModificationTime(), bb.duplicate()));

Review comment:
       getFileLength() is still available. Why this change?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java
##########
@@ -2003,6 +2005,22 @@ private static IntegerColumnStatistics 
deserializeIntColumnStatistics(List<OrcPr
    * @param colStats The statistics array
    * @return The min record key
    */
+  private static OrcRawRecordMerger.KeyInterval 
getKeyInterval(ColumnStatistics[] colStats) {

Review comment:
       This and `getKeyInterval(List<OrcProto.ColumnStatistics> colStats)` are 
almost the same. Can you get rid of this duplication?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/EncodedTreeReaderFactory.java
##########
@@ -2585,6 +2590,7 @@ private static TreeReader getPrimitiveTreeReader(final 
int columnIndex,
             .setColumnEncoding(columnEncoding)
             .setVectors(vectors)
             .setContext(context)
+            .setIsInstant(columnType.getCategory()  == 
TypeDescription.Category.TIMESTAMP_INSTANT)

Review comment:
       Isn't this always `false`? Don't we need another case for 
TIMESTAMP_INSTANT?

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/orc/encoded/LlapDataReader.java
##########
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hive.ql.io.orc.encoded;
+
+import org.apache.hadoop.hive.common.io.DiskRangeList;
+import org.apache.orc.CompressionCodec;
+import org.apache.orc.OrcFile;
+import org.apache.orc.OrcProto;
+import org.apache.orc.StripeInformation;
+import org.apache.orc.TypeDescription;
+import org.apache.orc.impl.OrcIndex;
+
+import java.io.IOException;
+import java.nio.ByteBuffer;
+
+/** An abstract data reader that IO formats can use to read bytes from 
underlying storage. */
+public interface LlapDataReader extends AutoCloseable, Cloneable {
+
+  /** Opens the DataReader, making it ready to use. */
+  void open() throws IOException;
+
+  OrcIndex readRowIndex(StripeInformation stripe,
+      TypeDescription fileSchema,
+      OrcProto.StripeFooter footer,
+      boolean ignoreNonUtf8BloomFilter,
+      boolean[] included,
+      OrcProto.RowIndex[] indexes,
+      boolean[] sargColumns,
+      OrcFile.WriterVersion version,
+      OrcProto.Stream.Kind[] bloomFilterKinds,
+      OrcProto.BloomFilterIndex[] bloomFilterIndices
+  ) throws IOException;
+
+  OrcProto.StripeFooter readStripeFooter(StripeInformation stripe) throws 
IOException;
+
+  /** Reads the data.
+   *
+   * Note that for the cases such as zero-copy read, caller must release the 
disk ranges
+   * produced after being done with them. Call isTrackingDiskRanges to find 
out if this is needed.
+   * @param range List if disk ranges to read. Ranges with data will be 
ignored.
+   * @param baseOffset Base offset from the start of the file of the ranges in 
disk range list.
+   * @param doForceDirect Whether the data should be read into direct buffers.
+   * @return New or modified list of DiskRange-s, where all the ranges are 
filled with data.
+   */
+  DiskRangeList readFileData(
+      DiskRangeList range, long baseOffset, boolean doForceDirect) throws 
IOException;
+
+
+  /**
+   * Whether the user should release buffers created by readFileData. See 
readFileData javadoc.
+   */
+  boolean isTrackingDiskRanges();
+
+  /**
+   * Releases buffers created by readFileData. See readFileData javadoc.
+   * @param toRelease The buffer to release.
+   */
+  void releaseBuffer(ByteBuffer toRelease);
+
+  /**
+   * Clone the entire state of the DataReader with the assumption that the
+   * clone will be closed at a different time. Thus, any file handles in the
+   * implementation need to be cloned.
+   * @return a new instance
+   */
+  LlapDataReader clone();
+
+  @Override
+  void close() throws IOException;
+
+  /**
+   * Returns the compression codec used by this datareader.
+   * We should consider removing this from the interface.
+   * @return the compression codec
+   */
+  CompressionCodec getCompressionCodec();

Review comment:
       This interface looks like a copy of ORC's DataReader except this method. 
ORC's DataReader returns as StreamOptions instead of CompressionCodec. As far 
as i understand, StreamOptions includes the compression codec and more info. 
Morevover, I see other parts of the code already make use of StreamOptions so 
it should be possible to integrate this part of the code too. I do not 
understand why LlapDataReader interface is necessary.

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/sarg/ConvertAstToSearchArg.java
##########
@@ -68,6 +70,10 @@
 
   private static final int KRYO_OUTPUT_BUFFER_SIZE = 4 * 1024;
   private static final int KRYO_OUTPUT_BUFFER_MAX_SIZE = 10 * 1024 * 1024;
+  private static final GregorianCalendar PROLEPTIC = new GregorianCalendar();

Review comment:
       This is not used anywhere?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] mustafaiman commented on a change in pull request #1823: HIVE-23553: Upgrade ORC version to 1.6.7

Reply via email to