xinqiu.hu created HADOOP-18395: ---------------------------------- Summary: Performance improvement in org.apache.hadoop.io.Text.find() Key: HADOOP-18395 URL: https://issues.apache.org/jira/browse/HADOOP-18395 Project: Hadoop Common Issue Type: Improvement Components: io Reporter: xinqiu.hu
The current implementation reset src and tgt to the mark and continues searching when tgt has remaining and src expired first. which is probably not necessary. {code:java} public int find(String what, int start) { try { ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length); ByteBuffer tgt = encode(what); byte b = tgt.get(); src.position(start); while (src.hasRemaining()) { if (b == src.get()) { // matching first byte src.mark(); // save position in loop tgt.mark(); // save position in target boolean found = true; int pos = src.position()-1; while (tgt.hasRemaining()) { if (!src.hasRemaining()) { // src expired first tgt.reset(); src.reset(); found = false; break; } if (!(tgt.get() == src.get())) { tgt.reset(); src.reset(); found = false; break; // no match } } if (found) return pos; } } return -1; // not found } catch (CharacterCodingException e) { throw new RuntimeException("Should not have happened", e); } } {code} For example, when q is searched, it is found that src has no remaining, and src is reset to d to continue searching. But the remaining length of src is always smaller than tgt, at this point we can return -1 directly. {code:java} @Test public void testFind() throws Exception { Text text = new Text("abcd\u20acbdcd\u20ac"); assertThat(text.find("cd\u20acq")).isEqualTo(-1); } {code} Perhaps it could be: {code:java} public int find(String what, int start) { try { ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length); ByteBuffer tgt = encode(what); byte b = tgt.get(); src.position(start); while (src.hasRemaining()) { if (b == src.get()) { // matching first byte src.mark(); // save position in loop tgt.mark(); // save position in target boolean found = true; int pos = src.position()-1; while (tgt.hasRemaining()) { if (!src.hasRemaining()) { // src expired first return -1; } if (!(tgt.get() == src.get())) { tgt.reset(); src.reset(); found = false; break; // no match } } if (found) return pos; } } return -1; // not found } catch (CharacterCodingException e) { throw new RuntimeException("Should not have happened", e); } }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org