[
https://issues.apache.org/jira/browse/HDFS-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14273992#comment-14273992
]
Colin Patrick McCabe commented on HDFS-7602:
--------------------------------------------
To be fair, this question came up earlier, and there is a way of doing this
without downloading the whole file. You can use {{hadoop fs -cat | head -c
4096 | (your file type program of choice)}} to figure out the file type.
Should we have a file type guessing program? Yeah, I guess. The danger is
that people will get upset and consider it a bug when it guesses wrong. And
this is just going to happen much of the time. There are a bunch of formats
that don't even have magic numbers (I think Avro is one), so if data is in that
format, we're just going to have to print "unknown," I think (or maybe an Avro
expert can comment here about a way of identifying it?) I think Parquet and
SequenceFile we can get, but how about SequenceFiles within SequenceFiles?
Basically, this should be a hadoop-common JIRA, and I don't think it should be
in the "hadoop fs" tool. It can be a jar that you run with the {{hadoop jar}}
command. That will also make it easier for people to write their own
application-specific file type filters that uses this as a fallback.
> HDFS file utility
> -----------------
>
> Key: HDFS-7602
> URL: https://issues.apache.org/jira/browse/HDFS-7602
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client, tools
> Affects Versions: 2.5.0
> Reporter: James Kinley
> Priority: Minor
>
> Provide a utility to determine HDFS file formats and compression types, akin
> to Linux's file utility.
> There is no easy way to do this today, short of downloading a file and
> running Linux's file utility on it for at least some intelligence. Although,
> Linux's magic file does not contain any information to identify the leading
> bytes of Hadoop's common file formats, for example: 'S', 'E', 'Q' for
> SequenceFiles, or 'P', 'A', 'R', '1' for Parquet.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)