[
https://issues.apache.org/jira/browse/HIVE-27072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694420#comment-17694420
]
Taraka Rama Rao Lethavadla commented on HIVE-27072:
---------------------------------------------------
{quote}How is _SUCCESS an unwanted file ?
{quote}
I mean _SUCCESS is not relevant while reading data from files in
table/partition directory. If third party systems are creating files in the
data directory of the table, they should prefix them with . or _ which will
classify them as non data files and we will consider them as hidden, that is
the standard. Let's say if they outnumber the actual data file count, it may
cause delay when there is a need to traverse through the files in hive table
directory
My use-case is that a query on hive table is failing since there is an
unsupported folder exist inside table directory like
{noformat}
/user/hive/table1/part=1
/user/hive/table1/part=11
/user/hive/table1/part
/user/hive/table1/part=12
{noformat}
/user/hive/table1/part is not valid and queries will fail. At any given point
if one wants to understand why queries are getting failed, they can simply
issue a command to get snapshot of unrelated files from storage so that the
user can clean them
> create an sql query to validate a given table for partitions and list out any
> discrepancies in files/folders, list out empty files etc
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-27072
> URL: https://issues.apache.org/jira/browse/HIVE-27072
> Project: Hive
> Issue Type: New Feature
> Components: HiveServer2
> Reporter: Taraka Rama Rao Lethavadla
> Priority: Major
>
> There are couple of issues when partitions were corrupted or have additional
> unwanted files that will intervene query execution and fail.
> If we run query like "validate table table_name [partition(partition=a,..)]",
> the output should list
> * any unwanted files like empty/metadata files(like _SUCCESS etc)
> * any unwanted folders not confirming to the partition naming convention
> like test_folder where actual partition name looks like test=23
> * Too many staging directories, if we find many then cleanup is not
> happening properly after query execution
> * any file permission related issues like table has one owner, partition has
> another owner etc(Optional)
> We have something like this in Impala [Invalidate metadata and Refresh
> commands|https://impala.apache.org/docs/build/html/topics/impala_invalidate_metadata.html]
> So we can have something similar to that functionality in Hive
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)