[jira] [Commented] (SAMZA-310) Publish container logs to a SystemStream

Yan Fang (JIRA) Thu, 25 Sep 2014 19:15:07 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148628#comment-14148628
 ]


Yan Fang commented on SAMZA-310:
--------------------------------

1) aha, after exploring the code, I find actually the relation is 
*containerId:taskId:taskName = 1:1:N*. containerId == taskId. See the log below

{code}
 2014-09-25 18:58:24 SamzaAppMasterTaskManager [INFO] Claimed SSP taskNames 
Map(Partition 0 -> Set(SystemStreamPartition [kafka, wikipedia-raw, 0], 
SystemStreamPartition [kafka, wikipedia-raw2, 0]), Partition 1 -> 
Set(SystemStreamPartition [kafka, wikipedia-raw2, 1])) for container ID 0
 2014-09-25 18:58:24 SamzaAppMasterTaskManager [INFO] Task ID 0 using command 
bin/run-container.sh
 2014-09-25 18:58:24 SamzaAppMasterTaskManager [INFO] Task ID 0 using env Map(
{code}

I think one of the confusion comes from the YarnConfig class where 
{code}
val TASK_COUNT = "yarn.container.count"
{code}

And then in SamzaAppMasterTaskManager.scala, we use task Id as the container Id.

{code}
 info("Claimed SSP taskNames %s for container ID %s" format (sspTaskNames, 
taskId))
{code}

2) Another confusion is that, our container Id (0,1,2) is different from YARN's 
containerId, which is something like _container_1411696645653_0001_01_000002_, 
though they share the same name...

I think those two should be cleaned-up in SAMZA-348. This is really confusing. 
:)

3) After a few testing, I realize that whenever the AM restarts a new 
container, the taskId remains the same (this totally makes sense, though 
actually i do not know how yarn does this...) . So I think we can use the 
taskId (one container only has one taskId) as the partition key for the logs, 
instead of the YARN's containerId, which is not easy to map to certain 
partition.



> Publish container logs to a SystemStream
> ----------------------------------------
>
>                 Key: SAMZA-310
>                 URL: https://issues.apache.org/jira/browse/SAMZA-310
>             Project: Samza
>          Issue Type: New Feature
>          Components: container
>    Affects Versions: 0.7.0
>            Reporter: Martin Kleppmann
>            Assignee: Yan Fang
>
> At the moment, it's a bit awkward to get to a Samza job's logs: assuming 
> you're running on YARN, you have to navigate around the YARN web interface, 
> and you can only see one container's logs at a time.
> Given that Samza is all about streams, it would make sense for the logs 
> generated by Samza jobs to also be sent to a stream. There, they could be 
> indexed with [Kibana|http://www.elasticsearch.org/overview/kibana/], consumed 
> by an exception-tracking system, etc.
> Notes:
> - The serde for encoding logs into a suitable wire format should be 
> pluggable. There can be a default implementation that uses JSON, analogous to 
> MetricsSnapshotSerdeFactory for metrics, but organisations that already have 
> a standardised in-house encoding for logs should be able to use it.
> - Should this be at the level of Slf4j or Log4j? Currently the log 
> configuration for YARN jobs uses Log4j, which has the advantage that any 
> frameworks/libraries that use Log4j but not Slf4j appear in the logs. 
> However, Samza itself currently only depends on Slf4j. If we tie this feature 
> to Log4j, it would somewhat defeat the purpose of using Slf4j.
> - Do we need to consider partitioning? Perhaps we can use the container name 
> as partitioning key, so that the ordering of logs from each container is 
> preserved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-310) Publish container logs to a SystemStream

Reply via email to