[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-12-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836831#comment-13836831
 ] 

Karthik Kambatla commented on YARN-1390:


Agree with Steve and Alejandro. Copied the gist to YARN-1399.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-12-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836823#comment-13836823
 ] 

Steve Loughran commented on YARN-1390:
--

oh and restrict the tag names to stuff that works well in URLs

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-12-02 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836723#comment-13836723
 ] 

Alejandro Abdelnur commented on YARN-1390:
--

Agree with Steve, we should limit the length of a tag and number of tags. I'd 
suggest going hardcoded for now, i.e. 50chars/10tags and going configurable 
later if the need arises.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-12-02 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836410#comment-13836410
 ] 

Steve Loughran commented on YARN-1390:
--

# Some limits on tag size is going to be needed, obviously. If AMs can update 
tag data they can use it as a store of information, which would be convenient 
and dangerous.

# app metadata is visible to all so users need to be reminded to limit what 
they say


> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-28 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835216#comment-13835216
 ] 

Karthik Kambatla commented on YARN-1390:


As proposed, let us handle the bulk of this on YARN-1399. Leaving this JIRA 
open to handle any pending work specific to lineage information.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832138#comment-13832138
 ] 

Karthik Kambatla commented on YARN-1390:


Doing this by the way of tags seems reasonable to me. However, as [~zjshen] 
mentioned, faster filtering/lookup based on tags requires additional work of 
adding a map  (apps)> to store the apps corresponding 
to a particular tag in the RM. I propose we split this up into subtasks, so we 
can take care of the simple field-adding and filtering first and add the 
optimizations later. Can continue this discussion on YARN-1399. 

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-25 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832133#comment-13832133
 ] 

Zhijie Shen commented on YARN-1390:
---

bq. I like Zhijie's proposal of tags - it is the most general purpose and 
implementation-wise is no harder or easier than the other approaches. Shall we 
just do that now instead of proliferating YARN with more specific concepts?

I'm fine with the plan. The major difference I can think of is the internal 
storing structure of multiple tags, comparing to single one today, and the 
search on them. We can discuss the detail in YARN-1399.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-25 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832084#comment-13832084
 ] 

Vinod Kumar Vavilapalli commented on YARN-1390:
---

I like Zhijie's proposal of tags - it is the most general purpose and 
implementation-wise is no harder or easier than the other approaches. Shall we 
just do that now instead of proliferating YARN with more specific concepts?

I do see that adding semantic information to some of the tags at YARN level 
like applicationType etc is useful, but they can be implemented on top of tags.

Looking back, this will also solve some of the problems mentioned at YARN-1055. 
Oozie in the interim could use a tag and kill all applications of a given 
workflow till work-preserving RM restart is done.

IMO, we should go directly with tags. Thoughts? If everyone is okay, Zhijie/I 
can take a stab at it via YARN-1399.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832041#comment-13832041
 ] 

Karthik Kambatla commented on YARN-1390:


An alternative way to look at this is "application-groups". An appGroup can 
have multiple applications in it, and users might want to know details about 
the applications within this group. Users might want to run all apps 
corresponding to a single "project" under one group. The field itself could be 
plain text to be matched using {{String#equals}} as opposed to pattern-matching.

[~hitesh], does an appGroup sound reasonable to you? It is the exact same 
thing, but a group is more general compared to lineage information, and can be 
used for other purposes. If need be, we can consider adding multiple appGroups 
to an application in the future.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-25 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832039#comment-13832039
 ] 

Karthik Kambatla commented on YARN-1390:


bq. but I am still not sure if having an applicationLineage field is a good 
idea.
I agree that it will not be used in YARN or AHS, but I think is key metadata to 
be associated with an application. This, IMO, is similar to appName which is 
also not directly used by YARN, but is basic information corresponding to an 
app. 

bq.  If it is an implementation question and we are looking at supporting some 
form of tags, I believe a key-value map is the best approach. Tags can be 
represented in the map.
Will  be an example of key-value pair in the map? If 
yes, are the keys predefined or upto the user? Pre-defined keys won't really 
help us capture everything users would want to tag the application with.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-12 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820723#comment-13820723
 ] 

Hitesh Shah commented on YARN-1390:
---

[~kkambatl] I haven't had a chance to discuss this with [~vinodkv] and 
[~zjshen] in person but I am still not sure if having an applicationLineage 
field is a good idea. Firstly, from a YARN point of view, it just a random 
string. It is an random text based attribute attached to an application. Its 
structure is not defined and there is no single place where it can be 
understood or made sense of. If it is an implementation question and we are 
looking at supporting some form of tags, I believe a key-value map is the best 
approach. Tags can be represented in the map. Supporting only tags could be 
done but supporting attributes with values is messy when using only tags. You 
can use a prefix based approach such as "key:value" but then you need to handle 
escaping the key and value to ensure the correct delimiter is used when 
interpreting the tag representing a pair.

Also, I don't believe supporting a search on this map should be added on the 
first pass. It is an expensive operation ( also have to consider the increase 
in memory footprint if an index is created ). Adding such a feature to the RM  
would have performance repercussions.


> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820608#comment-13820608
 ] 

Karthik Kambatla commented on YARN-1390:


Re-purposed MAPREDUCE-5618 to track the complimentary MR changes.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820602#comment-13820602
 ] 

Karthik Kambatla commented on YARN-1390:


bq. In the long term, it is feasible to integrate applicationType and 
applicationLineage when tags are available, and to be processed uniformly. 
Agree. Having these fields separately now and representing them as tags once 
tags become available seems most reasonable to me.

bq. It seems that another issue will be unchoking the tunnel to pass the 
lineage information from Oozie to YARN. It should go through MR, right?
Yes. There should be an MR change (as well as other frameworks if they choose 
to) to allow setting the lineage.

Looks like we have consensus on having an applicationLineage field. In the 
short-term, should we go ahead and make this a single element or a set? 


> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-12 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820297#comment-13820297
 ] 

Zhijie Shen commented on YARN-1390:
---

bq. Just to clarify, are you proposing a new field for Application that would 
be a key-value map and would be used to store tags, applicationLineage, etc?
bq. In the long term, yes

IMHO, tags are going to be a list instead of a key-value map. It doesn't make 
sense to have the key. If we define the keys, it will always exist the case 
that user cannot find suitable key to be associated with their words. If user 
define the keys, the keys will be anything as well (in an infinite domain), 
such that there's no difference between the keys and the values. Moreover, I'm 
afraid it doesn't make sense to let user note down a tag and also come up the 
aspect of it.

It seems we have already gone far beyond solving the problem here. The 
immediate solution to the problem seems to be adding another field, 
"applicationLineage" (maybe workflow?), while we must have "applicationType", 
and it should be the computation framework.

In the long term, it is feasible to integrate applicationType and 
applicationLineage when tags are available, and to be processed uniformly. 
setApplicationType and setApplicationLineage can be considered as the express 
way to add the special tags with "ApplicationType:" and "ApplicationLineage:" 
prefix respectively.

bq.  Further, it would be nice to index the apps by these tags, so we don't 
have to iterate through all the applications and filter everytime we query the 
RM.

Agree. Not only for tags and the potential new fields of an application, but 
also for the existing fields. I've suggested the same thing in YARN-1001. It is 
obviously not efficient to iterate over all the applications in RMContext to 
find the desired applications. We may need the index mechanism. I also reopened 
YARN-925 for the sake of pushing the filters into the implementation of AHS 
store, which should have the best knowledge of how to index and search 
applications. RM by default will hold 1 applications at most, and this may 
be still acceptable. However, AHS may host 1M finished applications, and it 
will be crazy to iterate over all the applications. Maybe we can resort to 
Lucene for index (in memory or in filesystem). Just think it out aloud.

bq. However, I do agree that enforcing applicationType of a YARN application 
contains exactly one of \{Tez, MAPREDUCE, Storm, Spark\}

I think it's good to have some enum values for the common computation 
frameworks. The benefits are:
1. Indicate what applicationType should be
2. Avoid ambiguous words as much as possible (e.g. "MapReduce", "mapreduce", 
"Map/Reduce", "MR", ...)
However, we should make the field open for users to input the applicationType 
that is not known to us.

Up till now, we've discussed a lot about how to host the information. Maybe 
it's better to focus more on the essential problem. It seems that another issue 
will be unchoking the tunnel to pass the lineage information from Oozie to 
YARN. It should go through MR, right? If other computation framework is used, 
that needs to be updated as well, right?

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820063#comment-13820063
 ] 

Karthik Kambatla commented on YARN-1390:


bq. Just to clarify, are you proposing a new field for Application that would 
be a key-value map and would be used to store tags, applicationLineage, etc?
In the long term, yes. In the short-term, having a field with a single value 
should suffice.

bq. Are you assuming the source info is just a simple well defined string such 
as "Oozie" or would Oozie do something like "Oozie:workflowId=1234" ?
We plan on using the Oozie-action-id; so, it is *not* a well-defined string. 
Let me explain the usecase in detail.

In YARN, a node failure can result in the failure of a subset of current AMs. 
In case of Oozie, if the Oozie launcher-AM fails and the action-AM doesn't, 
re-spawning the launcher-AM can result in two copies of the action-AM 
potentially leading to correctness issues. So, the plan is for the launcher AM 
to kill previously running action-AMs (if any) before starting new action-AMs. 
We need the lineage information to figure out the action-AMs the launcher 
started. 

bq. Also, from an implementation point of view, I would assume this map would 
be not be searchable.
Searchability can be of two types. Which one do you think we should avoid? 
# The internal RM data-structures using this "map" to "index" app-data. This 
would help in serving RM java/REST API queries faster. This comes with the 
overhead of maintaining these indices etc. I am not actively thinking about 
this; just a thought that crossed my mind.
# Allow querying for apps matching a particular tag (or Oozie-Action-Id) via 
filtering in the RM. While it might be okay to not support this in the 
first-cut, I am afraid this is something we should probably support. Otherwise, 
the client (Oozie) will end up asking for all the applications (in the time 
frame) and sift through them only to discard the remaining. 

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-12 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819992#comment-13819992
 ] 

Hitesh Shah commented on YARN-1390:
---

Also, from an implementation point of view, I would assume this map would be 
*not* be searchable. Free-form text or even a set of variable key-val pairs are 
expensive to search. Only defined fields such as applicationType ( which would 
contain only a single value ) should be searchable. 

bq. Representing applicationType as a set should suffice.

Representing it as a set is fine. However, how do you expect Oozie to pass 
source info to Pig which in turn will pass it to MR ? Are you assuming the 
source info is just a simple well defined string such as "Oozie" or would Oozie 
do something like "Oozie:workflowId=1234" ? I think lineage is something which 
YARN does not need to know or understand at the moment. Better to support it 
via the free-form map instead of introducing a new field which we are not sure 
how we plan to use/handle/support. 



> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-12 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819986#comment-13819986
 ] 

Hitesh Shah commented on YARN-1390:
---

Just to clarify, are you proposing a new field for Application that would be a 
key-value map and would be used to store tags, applicationLineage, etc? 

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-11 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819800#comment-13819800
 ] 

Karthik Kambatla commented on YARN-1390:


[~zjshen], [~hitesh]: thanks again for your inputs. I believe we are all mostly 
in agreement. 

In the longer term, I envision having tags and realizing other fields like 
applicationType, lineage information. Further, it would be nice to index the 
apps by these tags, so we don't have to iterate through all the applications 
and filter everytime we query the RM.

bq. How do you expect someone to search for all mapreduce jobs? Do a substring 
search?
Representing applicationType as a set should suffice. To check if an app is an 
MR job, one should be able to just do applicationType.contains("MAPREDUCE"). 
All components - JHS, AHS, Java / REST APIs - should be do make this small 
adjustment to continue working the way they do today. 

However, I do agree that enforcing applicationType of a YARN application 
contains *exactly one* of {Tez, MAPREDUCE, Storm, Spark} might lead to slight, 
albeit unnecessary complication. Given that, do we have consensus that we need 
applicationSource/ applicationLineage / tags? If so, what is the preferred name 
for this new field? [~hitesh], [~zjshen], [~vinodkv] - thoughts?




> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-11 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819416#comment-13819416
 ] 

Hitesh Shah commented on YARN-1390:
---

bq. If we allow setting multiple applicationTypes (or applicationSources) and 
by default add to the list, this is implicitly addressed. Between having a 
single applicationType field (with potentially multiple values) to address both 
requirements or a separate field, there might not be much difference. Do you 
see any drawbacks of using applicationType for these multiple values? Thanks.

Yes - I believe there is quite a bit of difference. How is this list of 
application types meant to be interpreted and by whom? Who defines ( and 
enforces ) the serialized structure of this list? ApplicationType supporting 
the single application framework type is very well defined and can be used by 
multiple components within YARN. How do you expect someone to search for all 
mapreduce jobs? Do a substring search? 

To continue on what Zhijie said, I think tags should probably be a different 
attribute on an application in addition to the applicationType. Rules on tags ( 
such as absence/presence) are not enforceable but an application should always 
have an applicationType. Of course, if we are just discussing implementation 
level details, applicationType could easily be implemented in a generic way via 
a notion of a tag or more appropriately an attribute with a value. 

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-11 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819150#comment-13819150
 ] 

Zhijie Shen commented on YARN-1390:
---

What I was originally proposing is to upgrade single *applicationType* to 
multiple *tags*. Actually, the current single applicationType can be considered 
as a tag: according to the name of this field, users are supposed to fill this 
field with an application type. However, we actually has no restriction on what 
the application type should be. Users are free to come up some words based on 
their understanding and requirements. For example, in the case that [~kkambatl] 
and [~rkanter] have mentioned, the ultimate program that submits the 
application will be used to identify the application type. On the other hand, 
users may want to classify applications according to the computation framework, 
such as mapreduce and tez. MAPREDUCE-5618 may immediately solve the problem 
that [~kkambatl] and [~rkanter] are encountering, but if the applicationType 
field is set to source, we can no longer search the applications according to 
their computation frameworks. To sum up, the single applicationType allow users 
to describe the applications only in one aspect.

In contrast, if we allows multiple tags to describe an application, users can 
annotate the application with both the source (e.g., pig) and the computation 
framework (e.g., mapreduce), and even other kind of information, such as 
"long-running application", and the tenant name. It will be pretty much like 
the tag system of online photos/videos/music, which allows users to describe 
the object with their own words. Otherwise, it is not efficient to add dedicate 
field (e.g., applicationSource) every time we come up with a new aspect to 
describe an application.

I'm not sure multiple tags is way we want to solve this issue, and I file 
another jira (YARN-1399) to trace multiple tags for an application. However, if 
we'd like to have an dedicate field for each aspect to describe the 
application, IMOH, it is good to restrict the word we can supply. For example, 
applicationType must be the name of a computation framework, and be chosen 
among mapreduce, tez, storm, and etc. Otherwise, we may expect a chaotic 
application type list: mapreduce, pig, hive, tez. And it should be similar for 
applicationSource. In conclusion, the dedicate field is better to behave like a 
category with predefined enumerated values.



> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-11 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819149#comment-13819149
 ] 

Karthik Kambatla commented on YARN-1390:


[~hitesh], thanks for your inputs.

bq. For lineage, something else should be introduced but it requires each and 
every layer to cooperate to augment the lineage data.
If we allow setting multiple applicationTypes (or applicationSources) and by 
default add to the list, this is implicitly addressed. Between having a single 
applicationType field (with potentially multiple values) to address both 
requirements or a separate field, there might not be much difference. Do you 
see any drawbacks of using applicationType for these multiple values? Thanks.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-11 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818866#comment-13818866
 ] 

Hitesh Shah commented on YARN-1390:
---

To add to the above comment, from a YARN point of view, an application is just 
an application. It has a defined type which can be used eventually to apply 
some logic or enforce rules ( for example, which history server to redirect to 
or which history plugin to apply ). The same application i.e same framework or 
same type of application when used in different contexts by definition should 
not have different types. 

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-11 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818863#comment-13818863
 ] 

Hitesh Shah commented on YARN-1390:
---

[~vinodkv] How is the application going to be identified from a application 
history point of view? 


There seems to be 2 different things which are required. Lineage to understand 
how an application was submitted ( this could be multi-levels deep ) and the 
other to identify the application itself. For example, what is the plan for a 
oozie job that launches a pig script that in turn runs multiple mapreduce jobs? 

I think applicationType as  it stands today should not change and should remain 
hardcoded by MR. For lineage, something else should be introduced but it 
requires each and every layer to cooperate to augment the lineage data. I dont 
think there is a quick fix here. This is something which can be introduced at 
the hadoop layer but will need to traverse through the whole ecosystem for it 
to work correctly.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-11 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818784#comment-13818784
 ] 

Karthik Kambatla commented on YARN-1390:


https://issues.apache.org/jira/browse/MAPREDUCE-5618?focusedCommentId=13818782&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13818782

Setting the applicationType to Oozie, Pig, Hive specific values might lead to 
incompatible changes. 

Adding multiple application-types for a single YARN application might be the 
best way forward. [~vinodkv], thoughts?

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-1390) Provide a way to capture source of an application to be queried through REST or Java Client APIs

2013-11-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13818751#comment-13818751
 ] 

Karthik Kambatla commented on YARN-1390:


[~vinodkv], thanks for the inputs.

bq. A Pig query is a pig query and jobs spawned for a Pig query should be of 
type Pig.
Correct. For our immediate usecase that [~rkanter] described, this should 
definitely be enough. Let me create an MR JIRA to make that pluggable. 

However, what if the pig query is part of an Oozie workflow? We could set it to 
the Oozie workflow/action id, and Pig/MR shouldn't override it. But then, if we 
want to list the Pig jobs being run by a user, we ll end up missing this job.

> Provide a way to capture source of an application to be queried through REST 
> or Java Client APIs
> 
>
> Key: YARN-1390
> URL: https://issues.apache.org/jira/browse/YARN-1390
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: api
>Affects Versions: 2.2.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>
> In addition to other fields like application-type (added in YARN-563), it is 
> useful to have an applicationSource field to track the source of an 
> application. The application source can be useful in (1) fetching only those 
> applications a user is interested in, (2) potentially adding source-specific 
> optimizations in the future. 
> Examples of sources are: User-defined project names, Pig, Hive, Oozie, Sqoop 
> etc.



--
This message was sent by Atlassian JIRA
(v6.1#6144)