[
https://issues.apache.org/jira/browse/IMPALA-11014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446526#comment-17446526
]
Csaba Ringhofer commented on IMPALA-11014:
------------------------------------------
Inserts to HDFS tables are not atomic in Impala by default - the only way to
make inserts really atomic is to use Hive ACID or Iceberg tables, but these
were added in newer versions of Impala.
We try to make them "as atomic as possible" by writing to a staging directory
and move the files to their final place with atomic moves - but if several
files are created and there is an error when only a subset of them were moved
then we get a partial write. Another possible issue is that in case of dynamic
partitioning the creation of new partitions can fail, leading to see the moved
files in existing partitions but not in new ones.
There are some cases when we don't even use staging directories, for example in
S3 when query option s3_skip_insert_staging is true. (the goal is to skip the
move operation as it is expensive in S3)
"MetaException: Object with id "" is managed by a different persistence manager
"
This error is not familiar to me, but I expect it to come from HMS. What
version of HMS is used?
>Can you suggest a workaround for this? Is it safe to assume that the data is
>always inserted when this particular error happens?
I am not sure - sending some parts of the Impala log could be helpful to see
where does this error come from. If it comes after file moves and partition
creation then the write can be considered "complete".
> Can we rely on the rows_inserted and rows_produced fields of the query in
> order to make assumptions about what data is inserted?
No, these are populated when we write the files, and does not tell anything
about whether moving the file was succcessful.
>Can you suggest a workaround for this?
It is possible to check if the moves for an INSERT were finished by checking
the staging directory in the filesystem - the name of the files are prefixed by
the query ID (e.g. 194f9d029d30bb07-fb64dc3300000000_945554289_data.0.txt is
created by query 194f9d029d30bb07:fb64dc3300000000) - if you see such files in
the staging directory, then not all moves were finished. It is also possible to
clean up files create by a given INSERT this way.
It would be great to have some counters in the profile for the intended number
of moves and the actually finished ones, but I don't know of anything like this.
> Data is being inserted even though an INSERT INTO query fails
> -------------------------------------------------------------
>
> Key: IMPALA-11014
> URL: https://issues.apache.org/jira/browse/IMPALA-11014
> Project: IMPALA
> Issue Type: Bug
> Reporter: Tsvetomir Palashki
> Priority: Major
>
> We are executing an INSERT INTO query against Impala. In rare cases this
> query fails with the following error:
> {code:java}
> MetaException: Object with id "" is managed by a different persistence
> manager {code}
> Even though there is an error, the data is inserted into the table. This is
> particularly problematic due to our error handling logic, which refreshes the
> table metadata and retries the query, which causes data duplication.
> I am aware that this bug might be fixed in one of the newer Impala versions,
> but at this point, we are unable to upgrade.
> Can you suggest a workaround for this? Is it safe to assume that the data is
> always inserted when this particular error happens? Can we rely on the
> rows_inserted and rows_produced fields of the query in order to make
> assumptions about what data is inserted?
> The exact version of our Impala is:
> {code:java}
> impalad version 3.2.0-cdh6.3.2 RELEASE (build
> 1bb9836227301b839a32c6bc230e35439d5984ac) Built on Fri Nov 8 07:22:06 PST
> 2019 {code}
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]