[ https://issues.apache.org/jira/browse/NIFI-2851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15582286#comment-15582286 ]
ASF GitHub Bot commented on NIFI-2851: -------------------------------------- Github user olegz commented on a diff in the pull request: https://github.com/apache/nifi/pull/1116#discussion_r83643595 --- Diff: nifi-commons/nifi-utils/src/main/java/org/apache/nifi/stream/io/util/TextLineDemarcator.java --- @@ -0,0 +1,227 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.stream.io.util; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStream; + +/** + * Implementation of demarcator of text lines in the provided + * {@link InputStream}. It works similar to the {@link BufferedReader} and its + * {@link BufferedReader#readLine()} methods except that it does not create a + * String representing the text line and instead returns the offset info for the + * computed text line. See {@link #nextOffsetInfo()} and + * {@link #nextOffsetInfo(byte[])} for more details. + * <p> + * This class is NOT thread-safe. + * </p> + */ +public class TextLineDemarcator { + + private final static int INIT_BUFFER_SIZE = 8192; + + private final InputStream is; + + private final int initialBufferSize; + + private byte[] buffer; + + private int index; + + private int mark; + + private long offset; + + private int bufferLength; + + /** + * Constructs an instance of demarcator with provided {@link InputStream} + * and default buffer size. + */ + public TextLineDemarcator(InputStream is) { + this(is, INIT_BUFFER_SIZE); + } + + /** + * Constructs an instance of demarcator with provided {@link InputStream} + * and initial buffer size. + */ + public TextLineDemarcator(InputStream is, int initialBufferSize) { + if (is == null) { + throw new IllegalArgumentException("'is' must not be null."); + } + if (initialBufferSize < 1) { + throw new IllegalArgumentException("'initialBufferSize' must be > 0."); + } + this.is = is; + this.initialBufferSize = initialBufferSize; + this.buffer = new byte[initialBufferSize]; + } + + /** + * Will compute the next <i>offset info</i> for a + * text line (line terminated by either '\r', '\n' or '\r\n'). + * <br> + * The <i>offset info</i> computed and returned as <code>long[]</code> consisting of + * 4 elements <code>{startOffset, length, crlfLength, startsWithMatch}</code>. + * <ul> + * <li><i>startOffset</i> - the offset in the overall stream which represents the beginning of the text line</li> + * <li><i>length</i> - length of the text line including CRLF characters</li> + * <li><i>crlfLength</i> - the length of the CRLF. Could be either 1 (if line ends with '\n' or '\r') + * or 2 (if line ends with '\r\n').</li> + * <li><i>startsWithMatch</i> - value is always 1. See {@link #nextOffsetInfo(byte[])} for more info.</li> + * </ul> + * + * @return offset info as <code>long[]</code> + */ + public long[] nextOffsetInfo() { --- End diff -- Yes, it would be easier to read, but based on running some performance tests there is also a price to pay for it although not very significant. Will change > Improve performance of SplitText > -------------------------------- > > Key: NIFI-2851 > URL: https://issues.apache.org/jira/browse/NIFI-2851 > Project: Apache NiFi > Issue Type: Improvement > Components: Core Framework > Reporter: Mark Payne > Assignee: Oleg Zhurakousky > Fix For: 1.1.0 > > > SplitText is fairly CPU-intensive and quite slow. A simple flow that splits a > 1.4 million line text file into 5k line chunks and then splits those 5k line > chunks into 1 line chunks is only capable of pushing through about 10k lines > per second. This equates to about 10 MB/sec. JVisualVM shows that the > majority of the time is spent in the locateSplitPoint() method. Isolating > this code and inspecting how it works, and using some micro-benchmarking, it > appears that if we refactor the calls to InputStream.read() to instead read > into a byte array, we can improve performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)