Chris, thanks — I’ve created a JIRA (SAMZA-342) to track this.

I’ve not really enabled metrics before on any of my jobs, so I’m sure I’m doing 
the right thing here.  To test, I’m running my job locally and I added the 
following to the job config file:

metrics.reporter.jmx.class=org.apache.samza.metrics.reporter.JmxReporterFactory
metrics.reporters=jmx

Running on trace level debugging, I do see a bunch of logs from JmxReporter on 
startup, but then nothing of the form “messages-read”.  I connected to the 
local task with jconsole, but I’m not seeing any obvious place to view counter 
output.

Should I be seeing this stuff in the logs?  Do I need to be running this as a 
YARN task rather than as a local job?  Thanks much!
—T


On Jul 16, 2014, at 9:35 AM, Chris Riccomini <[email protected]> 
wrote:

> Hey TJ,
> 
> To debug this, we're going to have to get our hands dirty with some
> metrics. Sorry. :(
> 
> The first step is to establish whether the BrokerProxy code in Samza is
> actively getting data for the realtime topic when you're processing batch
> messages. To do this, please turn on the JMX reporter for your job and
> look at this metric:
> 
>  %s-%s-messages-read" format (tp.topic, tp.partition))
> 
> 
> 
> Even if your job is in the middle of processing batch data, this metric
> should increase when messages are sent to the realtime topic. If it
> increases, then the message is received in realtime, but being buffered
> somewhere in the SamzaContainer before being handed off to your
> StreamTask. If the number does not increase, then the BrokerProxy either
> isn't fetching the realtime data, or the realtime data is being buffered
> somewhere upstream (on the producer, or in the broker).
> 
> Once we bisect here, we can continue narrowing down where the problem is.
> 
> Cheers,
> Chris
> 
> On 7/15/14 12:50 PM, "TJ Giuli" <[email protected]> wrote:
> 
>> Hey, Chris:
>> 1.  Yes, batch size set to 1: systems.kafka.producer.batch.num.messages=1
>> 2.   Building default chooser with: useBatching=false,
>> useBootstrapping=false, usePriority=true
>> 3.  I just re-ran my test and the delay is about 16 seconds.  When I’m
>> really pumping data through the batch Kafka topics, I’ve seen around 1
>> minute.
>> 
>> Thanks!
>> —T
>> 
>> On Jul 15, 2014, at 10:11 AM, Chris Riccomini
>> <[email protected]> wrote:
>> 
>>> Hey TJ,
>>> 
>>> This sounds like a bug.
>>> 
>>> 1. Is the task.consumer.batch.size config set for your job?
>>> 2. What does this log line say: Building default chooser with:
>>> useBatching=%s, useBootstrapping=%s, usePriority=%s
>>> 3. How long is the delay that you're seeing?
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> On 7/14/14 4:57 PM, "Yan Fang" <[email protected]> wrote:
>>> 
>>>> It seems for me that, the batch processing fetches many messages at one
>>>> time and then takes too long time to process. My first thought is that,
>>>> since we have the systems.system-name.samza.fetch.threshold
>>>> 
>>>> <http://samza.incubator.apache.org/learn/documentation/0.7.0/jobs/config
>>>> ur
>>>> ation-table.html>
>>>> ,
>>>> setting this number to a smaller (default is 50000) will force the
>>>> system
>>>> to fetch the messages more frequently.
>>>> 
>>>> Any other ideas?
>>>> 
>>>> Fang, Yan
>>>> [email protected]
>>>> +1 (206) 849-4108
>>>> 
>>>> 
>>>> On Mon, Jul 14, 2014 at 1:12 PM, TJ Giuli <[email protected]>
>>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have a stream processor that takes inputs from multiple streams,
>>>>> some
>>>>> are more batch, non-latency sensitive and others are real-time,
>>>>> infrequently have traffic and should be low-latency.  The real-time
>>>>> stream
>>>>> helps me interpret the batch stream, so I would ideally like any
>>>>> real-time
>>>>> stream envelopes delivered within some maximum latency from the time
>>>>> the
>>>>> message enters into a Kafka topic.
>>>>> 
>>>>> I have my stream processor configured to prioritize my real-time
>>>>> streams
>>>>> over the batch streams, but I consistently find that the real-time
>>>>> stream
>>>>> is delayed by traffic from the batch stream.  From tracing the Kafka
>>>>> consumer, it looks like my stream processor periodically fetches from
>>>>> Kafka, finds that the batch streams have a large chunk of messages
>>>>> waiting,
>>>>> doesn¹t find anything on the real-time topics, and processes away the
>>>>> batch
>>>>> messages for a few minutes.  During the batch processing, the Kafka
>>>>> consumer does not poll the real-time streams, so if a message is sent
>>>>> to a
>>>>> real-time topic, the message effectively doesn¹t arrive until the next
>>>>> time
>>>>> the Kafka consumer does another fetch.  When a real-time message is
>>>>> consumed by the Kafka consumer, the TieredPriorityChooser correctly
>>>>> prioritizes traffic from the real-time streams over the batch streams.
>>>>> 
>>>>> Does anyone have recommendations on how to get infrequent but
>>>>> important
>>>>> messages within some maximum time bound?  Thanks!
>>>>> ‹T
>>> 
>> 
> 

Reply via email to