merrimanr commented on a change in pull request #1330: METRON-1968: Messages are lost when a parser produces multiple messages and batch size is greater than 1 URL: https://github.com/apache/metron/pull/1330#discussion_r259100742
########## File path: metron-platform/metron-common/src/main/java/org/apache/metron/common/writer/BulkMessageWriter.java ########## @@ -32,18 +29,16 @@ void init(Map stormConf, TopologyContext topologyContext, WriterConfiguration config) throws Exception; /** - * Writes the messages to a particular output (e.g. Elasticsearch). Exceptions trigger failure of the entire batch. - * @param sensorType The type of sensor being generating the messages + * Writes the messages to a particular output (e.g. Elasticsearch). A response is returned with successful and failed message ids. + * @param sensorType The type of sensor generating the messages * @param configurations Configurations that should be passed to the writer (e.g. index and - * @param tuples The Tuples that produced the message to be written - * @param messages The message to be written + * @param messages A list of messages to be written. Message ids are used in the response to report successes/failures. * @return A response containing successes and failures within the batch. * @throws Exception If an unrecoverable error is made, an Exception is thrown which should be treated as a full-batch failure (e.g. target system is down). */ BulkWriterResponse write(String sensorType , WriterConfiguration configurations - , Iterable<Tuple> tuples - , List<MESSAGE_T> messages + , List<BulkWriterMessage<MESSAGE_T>> messages Review comment: I experimented with this and it's not as straightforward as it seems. Throughout all these classes it is assumed that messages are batched per sensor. If we move sensorType into each message than each class that consumes a `List<BulkMessage>` now has to handle it differently. For example the `ElasticsearchWriter` class looks up an index for the given sensor type and writes all messages to that index. With this change we would now need to lookup an index for each sensor type. Another example is `BatchTimeoutPolicy`. The list of messages isn't even used because it relies solely on the timeout configured for each sensor type. Do we get the sensor type from the list of messages now? Do we just use the sensor type of the first message or do we need to ensure the sensor type matches for all messages in the list? I feel like this is a significant change and wouldn't provide much benefit. If we do want to simplify these interfaces I would prefer to move the sensor type and list of messages into a wrapper object (`BulkWriterRequest` or something). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services