[jira] [Comment Edited] (IMPALA-6994) Avoid reloading a table's HMS data for file-only operations

Pranay Singh (JIRA) Mon, 16 Jul 2018 15:47:56 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545825#comment-16545825
 ]


Pranay Singh edited comment on IMPALA-6994 at 7/16/18 10:46 PM:
----------------------------------------------------------------

Here is a description of the problem which this jira aims to fix.

Components in the Hadoop ecosystem are loosely coupled so it is difficult to 
guarantee atomicity of operations
that span multiple components (no distributed transactions across components). 
An INSERT in Impala may modify
HDFS and Hive Meta Store. Unfortunately, if one of the steps in an INSERT fails 
it may require human intervention 
to clean up.

Impala's flow of operations during an INSERT:
--------------------------------------------------------------
a) Process SELECT portion and write results into temporary invisible HDFS files 
in parallel. Once the SELECT portion has completed and all temporary files have 
been written, the coordinator moves the temporary files into their permanent 
location in HDFS (resulting files will not be hidden any more).

b) The coordinator contacts the catalogd to update Impala's metadata cache with 
the new files and/or partitions
     On the catalogd:
     1) The file and block metadata of existing partitions that were modified 
is refreshed
     2) New partitions are created in the Hive Meta Store, if necessary

The table metadata (schema/location etc.) is refreshed from the Hive Meta 
Store. This step is not needed if INSERT happens to an existing partition.
So we can reduce these error scenarios for the cases by not calling Hive Meta 
Store when it is not needed.


was (Author: pranay_singh):
Here is a description of the problem which this jira aims to fix.

Components in the Hadoop ecosystem are loosely coupled so it is difficult to 
guarantee atomicity of operations
that span multiple components (no distributed transactions across components). 
An INSERT in Impala may modify
HDFS and Hive Meta Store. Unfortunately, if one of the steps in an INSERT fails 
it may require human intervention 
to clean up.

Impala's flow of operations during an INSERT:
--------------------------------------------------------------
a) Process SELECT portion and write results into temporary invisible HDFS files 
in parallel. Once the SELECT portion has completed and all temporary files have 
been written, the coordinator moves the temporary files into their permanent 
location in HDFS (resulting files will not be hidden any more).

b) The coordinator contacts the catalogd to update Impala's metadata cache with 
the new files and/or partitions
     On the catalogd:
     1) The file and block metadata of existing partitions that were modified 
is refreshed
     2) New partitions are created in the Hive Meta Store, if necessary

The table metadata (schema/location etc.) is refreshed from the Hive Meta 
Store. This step is not needed if INSERT happens to an existing partition.
So we can reduce these error scenarios for the cases by not calling Hive Meta 
Store for the cases when it is not needed.

> Avoid reloading a table's HMS data for file-only operations
> -----------------------------------------------------------
>
>                 Key: IMPALA-6994
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6994
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 2.12.0
>            Reporter: Balazs Jeszenszky
>            Assignee: Pranay Singh
>            Priority: Major
>
> Reloading file metadata for HDFS tables (e.g. as a final step in an 'insert') 
> is done via
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L628
> , which calls
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1243
> HdfsTable.load has no option to only load file metadata. HMS metadata will 
> also be reloaded every time, which is an unnecessary overhead (and potential 
> point of failure) when adding files to existing locations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (IMPALA-6994) Avoid reloading a table's HMS data for file-only operations

Reply via email to