[
https://issues.apache.org/jira/browse/SAMZA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172839#comment-14172839
]
Chris Riccomini commented on SAMZA-310:
---------------------------------------
bq. yeah, it's good to have a default topic. I would also like to allow users
to set their own topic name. Maybe they want to publish logs from different
jobs to the same topic name, thought it could be rare.
Totally. I think we could keep the config, but I just want it to "do the right
thing" in cases where users don't manually specify a topic.
bq. Do you mean, we pass the environment variables CONTAINER_COUNT,
CONTAINER_ID (TASK_ID)?
That's what I was thinking.
bq. CONTAINER_COUNT seems a little awkward because it has nothing to do with
the "environment" thought it's really straightforward.
Agreed. IMO, the best fix for this is to have it be part of what the AM exposes
as part of the HTTP JSON server discussed in SAMZA-348. In the meantime,
though, I the environment variable is really the only mechanism that we have to
pass data between the AM and its containers.
bq. When we use the MDC, we want to make this optional from the performance
perspective, right?
I don't think so. I think it should just be always on. These MDC settings only
need to be set at container start time, right? From then on, it should be
reads. It looks like SLF4J's MDC put/get calls are just [wrapping Log4J
directly|https://github.com/qos-ch/slf4j/blob/fdafef0253050516f64a2af80f57840cbd785a38/slf4j-log4j12/src/main/java/org/slf4j/impl/Log4jMDCAdapter.java],
which is a [put on a thread local
map|http://svn.apache.org/viewvc/logging/log4j/trunk/src/main/java/org/apache/log4j/MDC.java?view=markup].
If we wanted to really optimize this, we could have the Kafka appender grab
the MDC values once, and then never call the MDC again (to avoid thread locals).
Also, as far as performance test goes, you can run
[TestSamzaContainerPerformance.scala|https://github.com/apache/incubator-samza/blob/master/samza-test/src/test/scala/org/apache/samza/test/performance/TestSamzaContainerPerformance.scala]
if you're worried about performance.
> Publish container logs to a SystemStream
> ----------------------------------------
>
> Key: SAMZA-310
> URL: https://issues.apache.org/jira/browse/SAMZA-310
> Project: Samza
> Issue Type: New Feature
> Components: container
> Affects Versions: 0.7.0
> Reporter: Martin Kleppmann
> Assignee: Yan Fang
> Attachments: SAMZA-310.patch
>
>
> At the moment, it's a bit awkward to get to a Samza job's logs: assuming
> you're running on YARN, you have to navigate around the YARN web interface,
> and you can only see one container's logs at a time.
> Given that Samza is all about streams, it would make sense for the logs
> generated by Samza jobs to also be sent to a stream. There, they could be
> indexed with [Kibana|http://www.elasticsearch.org/overview/kibana/], consumed
> by an exception-tracking system, etc.
> Notes:
> - The serde for encoding logs into a suitable wire format should be
> pluggable. There can be a default implementation that uses JSON, analogous to
> MetricsSnapshotSerdeFactory for metrics, but organisations that already have
> a standardised in-house encoding for logs should be able to use it.
> - Should this be at the level of Slf4j or Log4j? Currently the log
> configuration for YARN jobs uses Log4j, which has the advantage that any
> frameworks/libraries that use Log4j but not Slf4j appear in the logs.
> However, Samza itself currently only depends on Slf4j. If we tie this feature
> to Log4j, it would somewhat defeat the purpose of using Slf4j.
> - Do we need to consider partitioning? Perhaps we can use the container name
> as partitioning key, so that the ordering of logs from each container is
> preserved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)