John Doe created HIVE-18216:
-------------------------------

             Summary: When Text is corrupted, processInput() hangs indefinitely
                 Key: HIVE-18216
                 URL: https://issues.apache.org/jira/browse/HIVE-18216
             Project: Hive
          Issue Type: Bug
    Affects Versions: 2.3.2
            Reporter: John Doe


When the Text is corrupted, the following loop become infinite.
This is because in hadoop.io.Text.bytesToCodePoint(), when extraBytesToRead == 
-1, the index in the ByteBuffer is not moved, and thus, ByteBuffer.remaining() 
is always > 0.
And it deletionSet.contains(-1), then this loop become infinite.

{code:java}
  private String processInput(Text input) {
    StringBuilder resultBuilder = new StringBuilder();
    // Obtain the byte buffer from the input string so we can traverse it code 
point by code point
    ByteBuffer inputBytes = ByteBuffer.wrap(input.getBytes(), 0, 
input.getLength());
    // Traverse the byte buffer containing the input string one code point at a 
time
    while (inputBytes.hasRemaining()) {
      int inputCodePoint = Text.bytesToCodePoint(inputBytes);
      // If the code point exists in deletion set, no need to emit out anything 
for this code point.
      // Continue on to the next code point
      if (deletionSet.contains(inputCodePoint)) {
        continue;
      }

      Integer replacementCodePoint = replacementMap.get(inputCodePoint);
      // If a replacement exists for this code point, emit out the replacement 
and append it to the
      // output string. If no such replacement exists, emit out the original 
input code point
      char[] charArray = Character.toChars((replacementCodePoint != null) ? 
replacementCodePoint
          : inputCodePoint);
      resultBuilder.append(charArray);
    }
    String resultString = resultBuilder.toString();
    return resultString;
  }
{code}

Here is the hadoop.io.Text.bytesToCodePoint() function.

{code:java}
  public static int bytesToCodePoint(ByteBuffer bytes) {
    bytes.mark();
    byte b = bytes.get();
    bytes.reset();
    int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
    if (extraBytesToRead < 0) return -1; // trailing byte!
    int ch = 0;

    switch (extraBytesToRead) {
    case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
    case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
    case 0: ch += (bytes.get() & 0xFF);
    }
    ch -= offsetsFromUTF8[extraBytesToRead];

    return ch;
  }
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to