[
https://issues.apache.org/jira/browse/TEZ-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569662#comment-14569662
]
Hitesh Shah commented on TEZ-2485:
----------------------------------
[~pramachandran] I think the assumption here is that quite a few things may
need to change. The main caveat would be that the UI should be able to detect
the version from either the dag entity or the app entity and use it to drive
the logic on how calls are made to ATS. Obviously this requires that the
TEZ_DAG_ID entity and/or TEZ_APPLICATION entity names to remain unchanged.
> Reduce the Resource Load on the Timeline Server
> -----------------------------------------------
>
> Key: TEZ-2485
> URL: https://issues.apache.org/jira/browse/TEZ-2485
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jonathan Eagles
> Attachments: TEZ-2485.REMOVE_TEZ_CONTAINER_ID.1.patch,
> TEZ-2485.SHORTER_ENTITIES.1.patch
>
>
> The disk, network, and memory resources needed by the timeline server are are
> many times higher than the need for the equivalent mapreduce job.
> Based on storage improvents YARN-3448, the timeline server may support up to
> 30,000 jobs / 10,000,000 tasks a
> day.
> While I understand there is community effort on timeline server v2, it
> will be good if Tez can reduce its pressure on the timeline server by
> auditing both the number of events and size of events.
> Here are some observations based on my understanding of the design of
> timeline stores:
> Each timeline entity pushed explodes into many records in the database
> 1 marker record
> 1 domain record
> 1 record per event
> 2 records per related entity
> 2 records per primary filter (2 record per primary filter in
> RollingLevelDBTimelineStore, in leveldb it rewrites entire entity
> records per primary filter )
> 1 record per other info
> For example
> Task Attempt Start
> 1 marker
> 1 domain
> 1 task attempt start event
> 1 related entity X 2
> 7 other info entries
> 4 primary filters X 2
> 20 records written in the database for task attempt start
> Task Attempt Finish
> 1 marker
> 1 domain
> 1 task attempt start event
> 1 related entity X 2
> 5 other info entries
> 5 primary filters X 2
> 20 records written in the database for task attempt finish
> =====================================================
> QUESTION:
> =====================================================
> Is there any data we are publishing to the timeline server that is not
> in the UI?
> Do we use all the entities (TEZ_CONTAINER_ID for example)
> Do we use all the primary filters?
> Do we use all the related entities specified?
> Are there any fields we don't use?
> Are there other approaches to consider to reduce entity count/size?
> Is there a way to store the same information in less space?
> ===================
> Key Value Breakdown
> ||Count||Key Size||Value Size||
> |5642512|533690380|745454867|
> Entity Type Breakdown
> ||Type||Count||Key Size||Value Size||
> |TEZ_CONTAINER_ID|843850|86244392|5654341|
> |applicationAttemptId|544|53248|6174|
> |applicationId|544|44412|6174|
> |TEZ_TASK_ATTEMPT_ID|2471393|239523553|373637209|
> |TEZ_APPLICATION|1048|84312|13057630|
> |containerId|362443|37013813|4135845|
> |TEZ_VERTEX_ID|99239|10387114|1559948|
> |TEZ_DAG_ID|5402|387705|2910830|
> |TEZ_TASK_ID|1762211|146210017|344478400|
> |TEZ_APPLICATION_ATTEMPT|95838|13741814|8316|
> Column Breakdown
> ||Column||Count||Key Size||Value Size||
> |primarykeys|1092413|118768299|0|
> |marker|373515|25740507|2988120|
> |events|578196|55148482|1156392|
> |domain|373515|26114022|15314115|
> |reverserelated|587815|73721347|0|
> |otherinfo|2143751|170983893|725996240|
> |related|493307|63213830|0|
> Other Info Key Breakdown
> ||Key||Count||Key Size||Value Size||
> |appSubmitTime|126|11466|1638|
> |vertexName|349|23732|3081|
> |stats|349|21987|142938|
> |applicationId|163|10106|5705|
> |exitStatus|84337|7337319|84559|
> |endTime|288538|22354866|3750994|
> |counters|204201|15474759|646685059|
> |startTime|204201|15678960|2654613|
> |nodeId|106761|8540880|3950157|
> |initTime|512|32325|6656|
> |numKilledTasks|512|35397|517|
> |timeTaken|204201|15678960|1061085|
> |inProgressLogsURL|106761|9715251|11741572|
> |config|126|8820|13037092|
> |scheduledTime|96928|7172672|1260064|
> |dagPlan|163|9128|2074899|
> |completedLogsURL|106761|9608490|22703699|
> |taskAttemptErrorEnum|15808|1485952|331784|
> |initRequestedTime|349|26175|4537|
> |startRequestedTime|349|26524|4537|
> |numFailedTasks|512|35397|512|
> |vertexNameIdMapping|163|11084|16157|
> |numSucceededTasks|512|36933|1054|
> |numKilledTaskAttempts|512|38981|521|
> |status|204201|15066357|2198349|
> |processorClassName|349|26524|18690|
> |numFailedTaskAttempts|512|38981|512|
> |tezVersion|126|9324|14364|
> |numTasks|349|23034|665|
> |successfulAttemptId|96785|7742800|4355325|
> |nodeHttpAddress|106761|9501729|3950157|
> |numCompletedTasks|512|36933|1056|
> |diagnostics|204201|16087362|915925|
> |containerId|106761|9074685|5017767|
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)