[ 
https://issues.apache.org/jira/browse/SAMZA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172588#comment-14172588
 ] 

Chris Riccomini commented on SAMZA-310:
---------------------------------------

bq. do not use MDC eventually. Instead, set a environment variable called Task 
Id. Use this directly as the key for the logs. That is because, if we use MDC, 
we still need to set the task id as the environment variable to pass it from AM 
to containers. It's better to directly use this environment variable.

I think we should use the MDC. There are three reasons for this:

# Using environment variables everywhere feels kind of like a global variable. 
This is just a matter of personal taste, but it bugs me.
# If we use MDC, I think we don't need to specify partition number and topic 
name in the log4j.xml (see below).
# If we use MDC, the ConversionPattern can refer to variables such as job name, 
task ID, etc in the log lines. This lets you print things like %X{jobName} 
%X{taskId}, which could be useful.

bq. need to specify the number of partitions and topic name in log4j.xml

If we use the MDC, I think the SamzaContainer (and Samza AM) can set the MDC 
with good defaults here. The AM knows the total number of containers 
(config.getTaskCount) and both the AM and SamzaContainer know the job name 
(config.getName). If the AM were to pass the container count (sadly, named task 
count right now), via an environment variable then the SamzaContainer could set 
all of this via the MDC, and the appender could use the proper partition count, 
and have a sane default topic name (e.g. __samza-<job name>-logs, or something).

This whole task/container mix-up is really confusing. I've opened up SAMZA-433 
to track renaming TASK_ID to CONTAINER_ID everywhere, and update YarnConfig 
accordingly.

> Publish container logs to a SystemStream
> ----------------------------------------
>
>                 Key: SAMZA-310
>                 URL: https://issues.apache.org/jira/browse/SAMZA-310
>             Project: Samza
>          Issue Type: New Feature
>          Components: container
>    Affects Versions: 0.7.0
>            Reporter: Martin Kleppmann
>            Assignee: Yan Fang
>         Attachments: SAMZA-310.patch
>
>
> At the moment, it's a bit awkward to get to a Samza job's logs: assuming 
> you're running on YARN, you have to navigate around the YARN web interface, 
> and you can only see one container's logs at a time.
> Given that Samza is all about streams, it would make sense for the logs 
> generated by Samza jobs to also be sent to a stream. There, they could be 
> indexed with [Kibana|http://www.elasticsearch.org/overview/kibana/], consumed 
> by an exception-tracking system, etc.
> Notes:
> - The serde for encoding logs into a suitable wire format should be 
> pluggable. There can be a default implementation that uses JSON, analogous to 
> MetricsSnapshotSerdeFactory for metrics, but organisations that already have 
> a standardised in-house encoding for logs should be able to use it.
> - Should this be at the level of Slf4j or Log4j? Currently the log 
> configuration for YARN jobs uses Log4j, which has the advantage that any 
> frameworks/libraries that use Log4j but not Slf4j appear in the logs. 
> However, Samza itself currently only depends on Slf4j. If we tie this feature 
> to Log4j, it would somewhat defeat the purpose of using Slf4j.
> - Do we need to consider partitioning? Perhaps we can use the container name 
> as partitioning key, so that the ordering of logs from each container is 
> preserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to