[ 
https://issues.apache.org/jira/browse/HADOOP-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124475#comment-15124475
 ] 

Sangjin Lee commented on HADOOP-12747:
--------------------------------------

Thanks [~jira.shegalov] for your suggestion. You brought up some important 
points and suggestions we need to reconcile.

It is true that today one can pass in a directory for \-files and also for 
\-libjars. In case of MR, the entire directory (including all files and 
directories recursively) does get copied over and localized to nodes. For 
libjars, however, as you observed, the classpath basically doesn't work *if you 
meant it as a list of jars* as it simply references the directory. On the other 
hand, if you meant it as a real directory root (consisting of class files), it 
still works correctly.

When it comes to classpaths (after which libjars is modeled), {{directory}} and 
{{directory/\*}} are different as you're undoubtedly aware. {{directory/\*}} is 
specifically interpreted as the list of jars in that directory by the JVM. IMO 
it would be good to maintain that definition for libjars. That would lead to a 
consistent expectation.

Also, I learned of this interesting nugget while looking at 
{{GenericOptionsParser}}: the value of libjars is added to the client classpath:
{code}
      //setting libjars in client classpath
      URL[] libjars = getLibJars(conf);
      if(libjars!=null && libjars.length>0) {
        conf.setClassLoader(new URLClassLoader(libjars, conf.getClassLoader()));
        Thread.currentThread().setContextClassLoader(
            new URLClassLoader(libjars, 
                Thread.currentThread().getContextClassLoader()));
      }
{code}
Thus, if we allow the wildcard, it will need to be expanded in 
{{GenericOptionsParser}} before this point.

On a related note, there is the matter of the shared cache (YARN-1492). For the 
shared cache to work correctly for a directory, the shared cache client needs 
to negotiate with the shared cache manager on an individual file basis (some 
files in the directory may be present in the shared cache; some may not). So a 
client-side expansion (at some point) is likely needed for the shared cache. 
We'll need to ensure the mapreduce portion of the shared cache work 
(MAPREDUCE-5951) handles directories correctly (cc [~ctrezzo]).

If needed, I can spell out a little more how {{directory}} and {{directory/*}} 
should be interpreted and used for libjars in comments/javadoc/documentation. 
Let me know. Thanks!

> support wildcard in libjars argument
> ------------------------------------
>
>                 Key: HADOOP-12747
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12747
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: util
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: HADOOP-12747.01.patch, HADOOP-12747.02.patch
>
>
> There is a problem when a user job adds too many dependency jars in their 
> command line. The HADOOP_CLASSPATH part can be addressed, including using 
> wildcards (\*). But the same cannot be done with the -libjars argument. Today 
> it takes only fully specified file paths.
> We may want to consider supporting wildcards as a way to help users in this 
> situation. The idea is to handle it the same way the JVM does it: \* expands 
> to the list of jars in that directory. It does not traverse into any child 
> directory.
> Also, it probably would be a good idea to do it only for libjars (i.e. don't 
> do it for -files and -archives).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to