[jira] [Created] (HADOOP-18395) Performance improvement in org.apache.hadoop.io.Text.find()

xinqiu.hu (Jira) Sun, 07 Aug 2022 00:19:12 -0700

xinqiu.hu created HADOOP-18395:
----------------------------------

             Summary: Performance improvement in 
org.apache.hadoop.io.Text.find()
                 Key: HADOOP-18395
                 URL: https://issues.apache.org/jira/browse/HADOOP-18395
             Project: Hadoop Common
          Issue Type: Improvement
          Components: io
            Reporter: xinqiu.hu



The current implementation reset src and tgt to the mark and continues 
searching when tgt has remaining and src expired first. which is probably not 
necessary.
{code:java}
public int find(String what, int start) {
  try {
    ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
    ByteBuffer tgt = encode(what);
    byte b = tgt.get();
    src.position(start);

    while (src.hasRemaining()) {
      if (b == src.get()) { // matching first byte
        src.mark(); // save position in loop
        tgt.mark(); // save position in target
        boolean found = true;
        int pos = src.position()-1;
        while (tgt.hasRemaining()) {
          if (!src.hasRemaining()) { // src expired first
            tgt.reset();
            src.reset();
            found = false;
            break;
          }
          if (!(tgt.get() == src.get())) {
            tgt.reset();
            src.reset();
            found = false;
            break; // no match
          }
        }
        if (found) return pos;
      }
    }
    return -1; // not found
  } catch (CharacterCodingException e) {
    throw new RuntimeException("Should not have happened", e);
  }
} {code}
For example, when q is searched, it is found that src has no remaining, and src 
is reset to d to continue searching. But the remaining length of src is always 
smaller than tgt, at this point we can return -1 directly.
{code:java}
@Test
public void testFind() throws Exception {
  Text text = new Text("abcd\u20acbdcd\u20ac");
  assertThat(text.find("cd\u20acq")).isEqualTo(-1);
} {code}
Perhaps it could be:
{code:java}
public int find(String what, int start) {
  try {
    ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
    ByteBuffer tgt = encode(what);
    byte b = tgt.get();
    src.position(start);

    while (src.hasRemaining()) {
      if (b == src.get()) { // matching first byte
        src.mark(); // save position in loop
        tgt.mark(); // save position in target
        boolean found = true;
        int pos = src.position()-1;
        while (tgt.hasRemaining()) {
          if (!src.hasRemaining()) { // src expired first
            return -1;
          }
          if (!(tgt.get() == src.get())) {
            tgt.reset();
            src.reset();
            found = false;
            break; // no match
          }
        }
        if (found) return pos;
      }
    }
    return -1; // not found
  } catch (CharacterCodingException e) {
    throw new RuntimeException("Should not have happened", e);
  }
}{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

[jira] [Created] (HADOOP-18395) Performance improvement in org.apache.hadoop.io.Text.find()

Reply via email to