[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836831#comment-13836831 ] Karthik Kambatla commented on YARN-1390: Agree with Steve and Alejandro. Copied the gist to YARN-1399. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836823#comment-13836823 ] Steve Loughran commented on YARN-1390: -- oh and restrict the tag names to stuff that works well in URLs > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836723#comment-13836723 ] Alejandro Abdelnur commented on YARN-1390: -- Agree with Steve, we should limit the length of a tag and number of tags. I'd suggest going hardcoded for now, i.e. 50chars/10tags and going configurable later if the need arises. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836410#comment-13836410 ] Steve Loughran commented on YARN-1390: -- # Some limits on tag size is going to be needed, obviously. If AMs can update tag data they can use it as a store of information, which would be convenient and dangerous. # app metadata is visible to all so users need to be reminded to limit what they say > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835216#comment-13835216 ] Karthik Kambatla commented on YARN-1390: As proposed, let us handle the bulk of this on YARN-1399. Leaving this JIRA open to handle any pending work specific to lineage information. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832138#comment-13832138 ] Karthik Kambatla commented on YARN-1390: Doing this by the way of tags seems reasonable to me. However, as [~zjshen] mentioned, faster filtering/lookup based on tags requires additional work of adding a map (apps)> to store the apps corresponding to a particular tag in the RM. I propose we split this up into subtasks, so we can take care of the simple field-adding and filtering first and add the optimizations later. Can continue this discussion on YARN-1399. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832133#comment-13832133 ] Zhijie Shen commented on YARN-1390: --- bq. I like Zhijie's proposal of tags - it is the most general purpose and implementation-wise is no harder or easier than the other approaches. Shall we just do that now instead of proliferating YARN with more specific concepts? I'm fine with the plan. The major difference I can think of is the internal storing structure of multiple tags, comparing to single one today, and the search on them. We can discuss the detail in YARN-1399. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832084#comment-13832084 ] Vinod Kumar Vavilapalli commented on YARN-1390: --- I like Zhijie's proposal of tags - it is the most general purpose and implementation-wise is no harder or easier than the other approaches. Shall we just do that now instead of proliferating YARN with more specific concepts? I do see that adding semantic information to some of the tags at YARN level like applicationType etc is useful, but they can be implemented on top of tags. Looking back, this will also solve some of the problems mentioned at YARN-1055. Oozie in the interim could use a tag and kill all applications of a given workflow till work-preserving RM restart is done. IMO, we should go directly with tags. Thoughts? If everyone is okay, Zhijie/I can take a stab at it via YARN-1399. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832041#comment-13832041 ] Karthik Kambatla commented on YARN-1390: An alternative way to look at this is "application-groups". An appGroup can have multiple applications in it, and users might want to know details about the applications within this group. Users might want to run all apps corresponding to a single "project" under one group. The field itself could be plain text to be matched using {{String#equals}} as opposed to pattern-matching. [~hitesh], does an appGroup sound reasonable to you? It is the exact same thing, but a group is more general compared to lineage information, and can be used for other purposes. If need be, we can consider adding multiple appGroups to an application in the future. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832039#comment-13832039 ] Karthik Kambatla commented on YARN-1390: bq. but I am still not sure if having an applicationLineage field is a good idea. I agree that it will not be used in YARN or AHS, but I think is key metadata to be associated with an application. This, IMO, is similar to appName which is also not directly used by YARN, but is basic information corresponding to an app. bq. If it is an implementation question and we are looking at supporting some form of tags, I believe a key-value map is the best approach. Tags can be represented in the map. Will be an example of key-value pair in the map? If yes, are the keys predefined or upto the user? Pre-defined keys won't really help us capture everything users would want to tag the application with. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820723#comment-13820723 ] Hitesh Shah commented on YARN-1390: --- [~kkambatl] I haven't had a chance to discuss this with [~vinodkv] and [~zjshen] in person but I am still not sure if having an applicationLineage field is a good idea. Firstly, from a YARN point of view, it just a random string. It is an random text based attribute attached to an application. Its structure is not defined and there is no single place where it can be understood or made sense of. If it is an implementation question and we are looking at supporting some form of tags, I believe a key-value map is the best approach. Tags can be represented in the map. Supporting only tags could be done but supporting attributes with values is messy when using only tags. You can use a prefix based approach such as "key:value" but then you need to handle escaping the key and value to ensure the correct delimiter is used when interpreting the tag representing a pair. Also, I don't believe supporting a search on this map should be added on the first pass. It is an expensive operation ( also have to consider the increase in memory footprint if an index is created ). Adding such a feature to the RM would have performance repercussions. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820608#comment-13820608 ] Karthik Kambatla commented on YARN-1390: Re-purposed MAPREDUCE-5618 to track the complimentary MR changes. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820602#comment-13820602 ] Karthik Kambatla commented on YARN-1390: bq. In the long term, it is feasible to integrate applicationType and applicationLineage when tags are available, and to be processed uniformly. Agree. Having these fields separately now and representing them as tags once tags become available seems most reasonable to me. bq. It seems that another issue will be unchoking the tunnel to pass the lineage information from Oozie to YARN. It should go through MR, right? Yes. There should be an MR change (as well as other frameworks if they choose to) to allow setting the lineage. Looks like we have consensus on having an applicationLineage field. In the short-term, should we go ahead and make this a single element or a set? > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820297#comment-13820297 ] Zhijie Shen commented on YARN-1390: --- bq. Just to clarify, are you proposing a new field for Application that would be a key-value map and would be used to store tags, applicationLineage, etc? bq. In the long term, yes IMHO, tags are going to be a list instead of a key-value map. It doesn't make sense to have the key. If we define the keys, it will always exist the case that user cannot find suitable key to be associated with their words. If user define the keys, the keys will be anything as well (in an infinite domain), such that there's no difference between the keys and the values. Moreover, I'm afraid it doesn't make sense to let user note down a tag and also come up the aspect of it. It seems we have already gone far beyond solving the problem here. The immediate solution to the problem seems to be adding another field, "applicationLineage" (maybe workflow?), while we must have "applicationType", and it should be the computation framework. In the long term, it is feasible to integrate applicationType and applicationLineage when tags are available, and to be processed uniformly. setApplicationType and setApplicationLineage can be considered as the express way to add the special tags with "ApplicationType:" and "ApplicationLineage:" prefix respectively. bq. Further, it would be nice to index the apps by these tags, so we don't have to iterate through all the applications and filter everytime we query the RM. Agree. Not only for tags and the potential new fields of an application, but also for the existing fields. I've suggested the same thing in YARN-1001. It is obviously not efficient to iterate over all the applications in RMContext to find the desired applications. We may need the index mechanism. I also reopened YARN-925 for the sake of pushing the filters into the implementation of AHS store, which should have the best knowledge of how to index and search applications. RM by default will hold 1 applications at most, and this may be still acceptable. However, AHS may host 1M finished applications, and it will be crazy to iterate over all the applications. Maybe we can resort to Lucene for index (in memory or in filesystem). Just think it out aloud. bq. However, I do agree that enforcing applicationType of a YARN application contains exactly one of \{Tez, MAPREDUCE, Storm, Spark\} I think it's good to have some enum values for the common computation frameworks. The benefits are: 1. Indicate what applicationType should be 2. Avoid ambiguous words as much as possible (e.g. "MapReduce", "mapreduce", "Map/Reduce", "MR", ...) However, we should make the field open for users to input the applicationType that is not known to us. Up till now, we've discussed a lot about how to host the information. Maybe it's better to focus more on the essential problem. It seems that another issue will be unchoking the tunnel to pass the lineage information from Oozie to YARN. It should go through MR, right? If other computation framework is used, that needs to be updated as well, right? > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820063#comment-13820063 ] Karthik Kambatla commented on YARN-1390: bq. Just to clarify, are you proposing a new field for Application that would be a key-value map and would be used to store tags, applicationLineage, etc? In the long term, yes. In the short-term, having a field with a single value should suffice. bq. Are you assuming the source info is just a simple well defined string such as "Oozie" or would Oozie do something like "Oozie:workflowId=1234" ? We plan on using the Oozie-action-id; so, it is *not* a well-defined string. Let me explain the usecase in detail. In YARN, a node failure can result in the failure of a subset of current AMs. In case of Oozie, if the Oozie launcher-AM fails and the action-AM doesn't, re-spawning the launcher-AM can result in two copies of the action-AM potentially leading to correctness issues. So, the plan is for the launcher AM to kill previously running action-AMs (if any) before starting new action-AMs. We need the lineage information to figure out the action-AMs the launcher started. bq. Also, from an implementation point of view, I would assume this map would be not be searchable. Searchability can be of two types. Which one do you think we should avoid? # The internal RM data-structures using this "map" to "index" app-data. This would help in serving RM java/REST API queries faster. This comes with the overhead of maintaining these indices etc. I am not actively thinking about this; just a thought that crossed my mind. # Allow querying for apps matching a particular tag (or Oozie-Action-Id) via filtering in the RM. While it might be okay to not support this in the first-cut, I am afraid this is something we should probably support. Otherwise, the client (Oozie) will end up asking for all the applications (in the time frame) and sift through them only to discard the remaining. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819992#comment-13819992 ] Hitesh Shah commented on YARN-1390: --- Also, from an implementation point of view, I would assume this map would be *not* be searchable. Free-form text or even a set of variable key-val pairs are expensive to search. Only defined fields such as applicationType ( which would contain only a single value ) should be searchable. bq. Representing applicationType as a set should suffice. Representing it as a set is fine. However, how do you expect Oozie to pass source info to Pig which in turn will pass it to MR ? Are you assuming the source info is just a simple well defined string such as "Oozie" or would Oozie do something like "Oozie:workflowId=1234" ? I think lineage is something which YARN does not need to know or understand at the moment. Better to support it via the free-form map instead of introducing a new field which we are not sure how we plan to use/handle/support. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819986#comment-13819986 ] Hitesh Shah commented on YARN-1390: --- Just to clarify, are you proposing a new field for Application that would be a key-value map and would be used to store tags, applicationLineage, etc? > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819800#comment-13819800 ] Karthik Kambatla commented on YARN-1390: [~zjshen], [~hitesh]: thanks again for your inputs. I believe we are all mostly in agreement. In the longer term, I envision having tags and realizing other fields like applicationType, lineage information. Further, it would be nice to index the apps by these tags, so we don't have to iterate through all the applications and filter everytime we query the RM. bq. How do you expect someone to search for all mapreduce jobs? Do a substring search? Representing applicationType as a set should suffice. To check if an app is an MR job, one should be able to just do applicationType.contains("MAPREDUCE"). All components - JHS, AHS, Java / REST APIs - should be do make this small adjustment to continue working the way they do today. However, I do agree that enforcing applicationType of a YARN application contains *exactly one* of {Tez, MAPREDUCE, Storm, Spark} might lead to slight, albeit unnecessary complication. Given that, do we have consensus that we need applicationSource/ applicationLineage / tags? If so, what is the preferred name for this new field? [~hitesh], [~zjshen], [~vinodkv] - thoughts? > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819416#comment-13819416 ] Hitesh Shah commented on YARN-1390: --- bq. If we allow setting multiple applicationTypes (or applicationSources) and by default add to the list, this is implicitly addressed. Between having a single applicationType field (with potentially multiple values) to address both requirements or a separate field, there might not be much difference. Do you see any drawbacks of using applicationType for these multiple values? Thanks. Yes - I believe there is quite a bit of difference. How is this list of application types meant to be interpreted and by whom? Who defines ( and enforces ) the serialized structure of this list? ApplicationType supporting the single application framework type is very well defined and can be used by multiple components within YARN. How do you expect someone to search for all mapreduce jobs? Do a substring search? To continue on what Zhijie said, I think tags should probably be a different attribute on an application in addition to the applicationType. Rules on tags ( such as absence/presence) are not enforceable but an application should always have an applicationType. Of course, if we are just discussing implementation level details, applicationType could easily be implemented in a generic way via a notion of a tag or more appropriately an attribute with a value. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819150#comment-13819150 ] Zhijie Shen commented on YARN-1390: --- What I was originally proposing is to upgrade single *applicationType* to multiple *tags*. Actually, the current single applicationType can be considered as a tag: according to the name of this field, users are supposed to fill this field with an application type. However, we actually has no restriction on what the application type should be. Users are free to come up some words based on their understanding and requirements. For example, in the case that [~kkambatl] and [~rkanter] have mentioned, the ultimate program that submits the application will be used to identify the application type. On the other hand, users may want to classify applications according to the computation framework, such as mapreduce and tez. MAPREDUCE-5618 may immediately solve the problem that [~kkambatl] and [~rkanter] are encountering, but if the applicationType field is set to source, we can no longer search the applications according to their computation frameworks. To sum up, the single applicationType allow users to describe the applications only in one aspect. In contrast, if we allows multiple tags to describe an application, users can annotate the application with both the source (e.g., pig) and the computation framework (e.g., mapreduce), and even other kind of information, such as "long-running application", and the tenant name. It will be pretty much like the tag system of online photos/videos/music, which allows users to describe the object with their own words. Otherwise, it is not efficient to add dedicate field (e.g., applicationSource) every time we come up with a new aspect to describe an application. I'm not sure multiple tags is way we want to solve this issue, and I file another jira (YARN-1399) to trace multiple tags for an application. However, if we'd like to have an dedicate field for each aspect to describe the application, IMOH, it is good to restrict the word we can supply. For example, applicationType must be the name of a computation framework, and be chosen among mapreduce, tez, storm, and etc. Otherwise, we may expect a chaotic application type list: mapreduce, pig, hive, tez. And it should be similar for applicationSource. In conclusion, the dedicate field is better to behave like a category with predefined enumerated values. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819149#comment-13819149 ] Karthik Kambatla commented on YARN-1390: [~hitesh], thanks for your inputs. bq. For lineage, something else should be introduced but it requires each and every layer to cooperate to augment the lineage data. If we allow setting multiple applicationTypes (or applicationSources) and by default add to the list, this is implicitly addressed. Between having a single applicationType field (with potentially multiple values) to address both requirements or a separate field, there might not be much difference. Do you see any drawbacks of using applicationType for these multiple values? Thanks. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818866#comment-13818866 ] Hitesh Shah commented on YARN-1390: --- To add to the above comment, from a YARN point of view, an application is just an application. It has a defined type which can be used eventually to apply some logic or enforce rules ( for example, which history server to redirect to or which history plugin to apply ). The same application i.e same framework or same type of application when used in different contexts by definition should not have different types. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818863#comment-13818863 ] Hitesh Shah commented on YARN-1390: --- [~vinodkv] How is the application going to be identified from a application history point of view? There seems to be 2 different things which are required. Lineage to understand how an application was submitted ( this could be multi-levels deep ) and the other to identify the application itself. For example, what is the plan for a oozie job that launches a pig script that in turn runs multiple mapreduce jobs? I think applicationType as it stands today should not change and should remain hardcoded by MR. For lineage, something else should be introduced but it requires each and every layer to cooperate to augment the lineage data. I dont think there is a quick fix here. This is something which can be introduced at the hadoop layer but will need to traverse through the whole ecosystem for it to work correctly. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818784#comment-13818784 ] Karthik Kambatla commented on YARN-1390: https://issues.apache.org/jira/browse/MAPREDUCE-5618?focusedCommentId=13818782&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13818782 Setting the applicationType to Oozie, Pig, Hive specific values might lead to incompatible changes. Adding multiple application-types for a single YARN application might be the best way forward. [~vinodkv], thoughts? > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs
[ https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818751#comment-13818751 ] Karthik Kambatla commented on YARN-1390: [~vinodkv], thanks for the inputs. bq. A Pig query is a pig query and jobs spawned for a Pig query should be of type Pig. Correct. For our immediate usecase that [~rkanter] described, this should definitely be enough. Let me create an MR JIRA to make that pluggable. However, what if the pig query is part of an Oozie workflow? We could set it to the Oozie workflow/action id, and Pig/MR shouldn't override it. But then, if we want to list the Pig jobs being run by a user, we ll end up missing this job. > Provide a way to capture source of an application to be queried through REST > or Java Client APIs > > > Key: YARN-1390 > URL: https://issues.apache.org/jira/browse/YARN-1390 > Project: Hadoop YARN > Issue Type: Improvement > Components: api >Affects Versions: 2.2.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla > > In addition to other fields like application-type (added in YARN-563), it is > useful to have an applicationSource field to track the source of an > application. The application source can be useful in (1) fetching only those > applications a user is interested in, (2) potentially adding source-specific > optimizations in the future. > Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop > etc. -- This message was sent by Atlassian JIRA (v6.1#6144)