[ 
https://issues.apache.org/jira/browse/HDFS-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273992#comment-14273992
 ] 

Colin Patrick McCabe commented on HDFS-7602:
--------------------------------------------

To be fair, this question came up earlier, and there is a way of doing this 
without downloading the whole file.  You can use {{hadoop fs -cat | head -c 
4096 | (your file type program of choice)}} to figure out the file type.

Should we have a file type guessing program?  Yeah, I guess.  The danger is 
that people will get upset and consider it a bug when it guesses wrong.  And 
this is just going to happen much of the time.  There are a bunch of formats 
that don't even have magic numbers (I think Avro is one), so if data is in that 
format, we're just going to have to print "unknown," I think (or maybe an Avro 
expert can comment here about a way of identifying it?)  I think Parquet and 
SequenceFile we can get, but how about SequenceFiles within SequenceFiles?

Basically, this should be a hadoop-common JIRA, and I don't think it should be 
in the "hadoop fs" tool.  It can be a jar that you run with the {{hadoop jar}} 
command.  That will also make it easier for people to write their own 
application-specific file type filters that uses this as a fallback.

> HDFS file utility
> -----------------
>
>                 Key: HDFS-7602
>                 URL: https://issues.apache.org/jira/browse/HDFS-7602
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, tools
>    Affects Versions: 2.5.0
>            Reporter: James Kinley
>            Priority: Minor
>
> Provide a utility to determine HDFS file formats and compression types, akin 
> to Linux's file utility.
> There is no easy way to do this today, short of downloading a file and 
> running Linux's file utility on it for at least some intelligence. Although, 
> Linux's magic file does not contain any information to identify the leading 
> bytes of Hadoop's common file formats, for example: 'S', 'E', 'Q' for 
> SequenceFiles, or 'P', 'A', 'R', '1' for Parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to