John Doe created HIVE-18216: ------------------------------- Summary: When Text is corrupted, processInput() hangs indefinitely Key: HIVE-18216 URL: https://issues.apache.org/jira/browse/HIVE-18216 Project: Hive Issue Type: Bug Affects Versions: 2.3.2 Reporter: John Doe
When the Text is corrupted, the following loop become infinite. This is because in hadoop.io.Text.bytesToCodePoint(), when extraBytesToRead == -1, the index in the ByteBuffer is not moved, and thus, ByteBuffer.remaining() is always > 0. And it deletionSet.contains(-1), then this loop become infinite. {code:java} private String processInput(Text input) { StringBuilder resultBuilder = new StringBuilder(); // Obtain the byte buffer from the input string so we can traverse it code point by code point ByteBuffer inputBytes = ByteBuffer.wrap(input.getBytes(), 0, input.getLength()); // Traverse the byte buffer containing the input string one code point at a time while (inputBytes.hasRemaining()) { int inputCodePoint = Text.bytesToCodePoint(inputBytes); // If the code point exists in deletion set, no need to emit out anything for this code point. // Continue on to the next code point if (deletionSet.contains(inputCodePoint)) { continue; } Integer replacementCodePoint = replacementMap.get(inputCodePoint); // If a replacement exists for this code point, emit out the replacement and append it to the // output string. If no such replacement exists, emit out the original input code point char[] charArray = Character.toChars((replacementCodePoint != null) ? replacementCodePoint : inputCodePoint); resultBuilder.append(charArray); } String resultString = resultBuilder.toString(); return resultString; } {code} Here is the hadoop.io.Text.bytesToCodePoint() function. {code:java} public static int bytesToCodePoint(ByteBuffer bytes) { bytes.mark(); byte b = bytes.get(); bytes.reset(); int extraBytesToRead = bytesFromUTF8[(b & 0xFF)]; if (extraBytesToRead < 0) return -1; // trailing byte! int ch = 0; switch (extraBytesToRead) { case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */ case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */ case 3: ch += (bytes.get() & 0xFF); ch <<= 6; case 2: ch += (bytes.get() & 0xFF); ch <<= 6; case 1: ch += (bytes.get() & 0xFF); ch <<= 6; case 0: ch += (bytes.get() & 0xFF); } ch -= offsetsFromUTF8[extraBytesToRead]; return ch; } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)