[
https://issues.apache.org/jira/browse/IMPALA-6994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545825#comment-16545825
]
Pranay Singh edited comment on IMPALA-6994 at 7/16/18 10:46 PM:
----------------------------------------------------------------
Here is a description of the problem which this jira aims to fix.
Components in the Hadoop ecosystem are loosely coupled so it is difficult to
guarantee atomicity of operations
that span multiple components (no distributed transactions across components).
An INSERT in Impala may modify
HDFS and Hive Meta Store. Unfortunately, if one of the steps in an INSERT fails
it may require human intervention
to clean up.
Impala's flow of operations during an INSERT:
--------------------------------------------------------------
a) Process SELECT portion and write results into temporary invisible HDFS files
in parallel. Once the SELECT portion has completed and all temporary files have
been written, the coordinator moves the temporary files into their permanent
location in HDFS (resulting files will not be hidden any more).
b) The coordinator contacts the catalogd to update Impala's metadata cache with
the new files and/or partitions
On the catalogd:
1) The file and block metadata of existing partitions that were modified
is refreshed
2) New partitions are created in the Hive Meta Store, if necessary
The table metadata (schema/location etc.) is refreshed from the Hive Meta
Store. This step is not needed if INSERT happens to an existing partition.
So we can reduce these error scenarios for the cases by not calling Hive Meta
Store when it is not needed.
was (Author: pranay_singh):
Here is a description of the problem which this jira aims to fix.
Components in the Hadoop ecosystem are loosely coupled so it is difficult to
guarantee atomicity of operations
that span multiple components (no distributed transactions across components).
An INSERT in Impala may modify
HDFS and Hive Meta Store. Unfortunately, if one of the steps in an INSERT fails
it may require human intervention
to clean up.
Impala's flow of operations during an INSERT:
--------------------------------------------------------------
a) Process SELECT portion and write results into temporary invisible HDFS files
in parallel. Once the SELECT portion has completed and all temporary files have
been written, the coordinator moves the temporary files into their permanent
location in HDFS (resulting files will not be hidden any more).
b) The coordinator contacts the catalogd to update Impala's metadata cache with
the new files and/or partitions
On the catalogd:
1) The file and block metadata of existing partitions that were modified
is refreshed
2) New partitions are created in the Hive Meta Store, if necessary
The table metadata (schema/location etc.) is refreshed from the Hive Meta
Store. This step is not needed if INSERT happens to an existing partition.
So we can reduce these error scenarios for the cases by not calling Hive Meta
Store for the cases when it is not needed.
> Avoid reloading a table's HMS data for file-only operations
> -----------------------------------------------------------
>
> Key: IMPALA-6994
> URL: https://issues.apache.org/jira/browse/IMPALA-6994
> Project: IMPALA
> Issue Type: Improvement
> Components: Catalog
> Affects Versions: Impala 2.12.0
> Reporter: Balazs Jeszenszky
> Assignee: Pranay Singh
> Priority: Major
>
> Reloading file metadata for HDFS tables (e.g. as a final step in an 'insert')
> is done via
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L628
> , which calls
> https://github.com/apache/impala/blob/branch-2.12.0/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java#L1243
> HdfsTable.load has no option to only load file metadata. HMS metadata will
> also be reloaded every time, which is an unnecessary overhead (and potential
> point of failure) when adding files to existing locations.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]