[ 
https://issues.apache.org/jira/browse/AVRO-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968513#comment-16968513
 ] 

Ryan Skraba edited comment on AVRO-2616 at 11/6/19 5:13 PM:
------------------------------------------------------------

Quick investigation: the hadoop classes are only used for opening InputStreams 
(or SeekableInputStreams) and passed to core classes.  

In most cases, it would be pretty easy to abandon the Hadoop FileSystem 
entirely _if_ no one uses commands like {{avro-tools tojson 
hdfs://mynamenode.example.com:8020/user/rskraba/input.avro}}.  I've been using 
avro-tools for years... I didn't even realize that it _could_ take Hadoop file 
URIs!

(The exception to the above is the *tether* task which launches a MapReduce job 
and requires the hadoop classes).

I'm tempted to think :
# Remove the hadoop classes and transitive dependencies entirely from the uber 
jar.
# If the hadoop classes aren't present and a local file is specified, use java 
input streams and avro core classes.
# If the hadoop classes aren't present and an HDFS URI is specified, fail with 
a link to documentation on how to build an avro-tool customized for your 
cluster.
# If the hadoop classes _are_ present (because you built your own uber-jar for 
your cluster), use the current behaviour.

(Attempting to use the tether task should fail with the link to the 
documentation too.)

For info, we currently include support for local, HDFS, FTP and other ways of 
accessing HDFS (Har, Hftp, WebHDFS).




was (Author: ryanskraba):
Quick investigation: the hadoop classes are only used for opening InputStreams 
(or SeekableInputStreams) and passed to core classes.  

In most cases, it would be pretty easy to abandon the Hadoop FileSystem 
entirely _if_ no one uses commands like {{avro-tools tojson 
hdfs://mynamenode.example.com:8020/user/rskraba/input.avro}}.  I've been using 
avro-tools for years... I didn't even realize that it _could_ take Hadoop file 
URIs!

(The exception to the above is the *tether* task which launches a MapReduce job 
and requires the hadoop classes).

I'm tempted to think :
1. Remove the hadoop classes and transitive dependencies entirely from the uber 
jar.
1. If the hadoop classes aren't present and a local file is specified, use java 
input streams and avro core classes.
1. If the hadoop classes aren't present and an HDFS URI is specified, fail with 
a link to documentation on how to build an avro-tool customized for your 
cluster.
1. If the hadoop classes _are_ present (because you built your own uber-jar for 
your cluster), use the current behaviour.

(Attempting to use the tether task should fail with the link to the 
documentation too.)

For info, we currently include support for local, HDFS, FTP and other ways of 
accessing HDFS (Har, Hftp, WebHDFS).



> Do not use Hadoop FS for local files with avro-tools
> ----------------------------------------------------
>
>                 Key: AVRO-2616
>                 URL: https://issues.apache.org/jira/browse/AVRO-2616
>             Project: Apache Avro
>          Issue Type: Bug
>          Components: java
>            Reporter: Ryan Skraba
>            Priority: Minor
>
> The avro-tools jar includes the Hadoop dependencies inside the fat jar.  This 
> is probably so that the CLI can operate with files that are located on the 
> cluster (or other URIs that have Hadoop FileSystem implementations).
> This is useful if the user is accessing HDFS with the same version as the 
> Hadoop FileSystem, and *mostly* neutral if the user is accessing other URIs, 
> including the local filesystem.
> Hadoop doesn't currently officially support any version [after JDK 
> 8|https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions] 
> (at time of writing).  We might want to access the local filesystem bypassing 
> the Hadoop jars to avoid requiring this dependency when it is not needed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to