[ 
https://issues.apache.org/jira/browse/HIVE-27072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17694420#comment-17694420
 ] 

Taraka Rama Rao Lethavadla commented on HIVE-27072:
---------------------------------------------------

{quote}How is _SUCCESS an unwanted file ? 
{quote}
I mean _SUCCESS is not relevant while reading data from files in 
table/partition directory. If third party systems are creating files in the 
data directory of the table, they should prefix them with . or _ which will 
classify them as non data files and we will consider them as hidden, that is 
the standard. Let's say if they outnumber the actual data file count, it may 
cause delay when there is a need to traverse through the files in hive table 
directory

My use-case is that a query on hive table is failing since there is an 
unsupported folder exist inside table directory like

 
{noformat}
/user/hive/table1/part=1
/user/hive/table1/part=11
/user/hive/table1/part
/user/hive/table1/part=12
{noformat}
/user/hive/table1/part is not valid and queries will fail. At any given point 
if one wants to understand why queries are getting failed, they can simply 
issue a command to get snapshot of unrelated files from storage so that the 
user can clean them

> create an sql query to validate a given table for partitions and list out any 
> discrepancies in files/folders, list out empty files etc
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-27072
>                 URL: https://issues.apache.org/jira/browse/HIVE-27072
>             Project: Hive
>          Issue Type: New Feature
>          Components: HiveServer2
>            Reporter: Taraka Rama Rao Lethavadla
>            Priority: Major
>
> There are couple of issues when partitions were corrupted or have additional 
> unwanted files that will intervene query execution and fail. 
> If we run query like "validate table table_name [partition(partition=a,..)]", 
> the output should list
>  * any unwanted files like empty/metadata files(like _SUCCESS etc)
>  * any unwanted folders not confirming to the partition naming convention 
> like test_folder where actual partition name looks like test=23
>  * Too many staging directories, if we find many then cleanup is not 
> happening properly after query execution
>  * any file permission related issues like table has one owner, partition has 
> another owner etc(Optional)
> We have something like this in Impala [Invalidate metadata and Refresh 
> commands|https://impala.apache.org/docs/build/html/topics/impala_invalidate_metadata.html]
> So we can have something similar to that functionality in Hive
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to