[ 
https://issues.apache.org/jira/browse/IMPALA-11014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446526#comment-17446526
 ] 

Csaba Ringhofer commented on IMPALA-11014:
------------------------------------------

Inserts to HDFS tables are not atomic in Impala by default - the only way to 
make inserts really atomic is to use Hive ACID or Iceberg tables, but these 
were added in newer versions of Impala.

We try to make them "as atomic as possible" by writing to a staging directory 
and move the files to their final place with atomic moves - but if several 
files are created and there is an error when only a subset of them were moved 
then we get a partial write. Another possible issue is that in case of dynamic 
partitioning the creation of new partitions can fail, leading to see the moved 
files in existing partitions but not in new ones.

There are some cases when we don't even use staging directories, for example in 
S3 when query option s3_skip_insert_staging is true. (the goal is to skip the 
move operation as it is expensive in S3)

"MetaException: Object with id "" is managed by a different persistence manager 
"
This error is not familiar to me, but I expect it to come from HMS. What 
version of HMS is used?

>Can you suggest a workaround for this? Is it safe to assume that the data is 
>always inserted when this particular error happens?
I am not sure - sending some parts of the Impala log could be helpful to see 
where does this error come from. If it comes after file moves and partition 
creation then the write can be considered "complete".

> Can we rely on the rows_inserted and rows_produced fields of the query in 
> order to make assumptions about what data is inserted?
No, these are populated when we write the files, and does not tell anything 
about whether moving the file was succcessful.

>Can you suggest a workaround for this?
It is possible to check if the moves for an INSERT were finished by checking 
the staging directory in the filesystem - the name of the files are prefixed by 
the query ID (e.g. 194f9d029d30bb07-fb64dc3300000000_945554289_data.0.txt is 
created by query 194f9d029d30bb07:fb64dc3300000000) - if you see such files in 
the staging directory, then not all moves were finished. It is also possible to 
clean up files create by a given INSERT this way.

It would be great to have some counters in the profile for the intended number 
of moves and the actually finished ones, but I don't know of anything like this.

> Data is being inserted even though an INSERT INTO query fails
> -------------------------------------------------------------
>
>                 Key: IMPALA-11014
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11014
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Tsvetomir Palashki
>            Priority: Major
>
> We are executing an INSERT INTO query against Impala. In rare cases this 
> query fails with the following error:
> {code:java}
> MetaException: Object with id "" is managed by a different persistence 
> manager {code}
> Even though there is an error, the data is inserted into the table. This is 
> particularly problematic due to our error handling logic, which refreshes the 
> table metadata and retries the query, which causes data duplication.
> I am aware that this bug might be fixed in one of the newer Impala versions, 
> but at this point, we are unable to upgrade.
> Can you suggest a workaround for this? Is it safe to assume that the data is 
> always inserted when this particular error happens? Can we rely on the 
> rows_inserted and rows_produced fields of the query in order to make 
> assumptions about what data is inserted?
> The exact version of our Impala is:
> {code:java}
> impalad version 3.2.0-cdh6.3.2 RELEASE (build 
> 1bb9836227301b839a32c6bc230e35439d5984ac) Built on Fri Nov 8 07:22:06 PST 
> 2019 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to