[jira] [Commented] (ATLAS-629) Kafka messages in ATLAS_HOOK might be lost in HA mode at the instant of failover.

2016-05-05 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15272262#comment-15272262
 ] 

Hemanth Yamijala commented on ATLAS-629:


Started looking at the approach to fix this problem. With Kafka's (old) high 
level consumer, we only have *atmost-once* delivery because the offsets read 
from the partitions are auto committed by default. So if a message is read and 
offset auto committed, but before the metadata ingest is completed, the server 
reboots, then this message could be lost for processing.

To fix this issue, I am looking at *atleast-once* delivery semantics with 
Kafka, under the assumption that *message processing can be idempotent on the 
server*. Given we use transactions in Titan and also have create-or-update 
semantics, this may be mostly true - but not really sure. Will need to test.

To move to atleast-once processing, the predominant approach people follow 
seems to be to:
* disable auto commit
* Create one ConsumerConnector per partition of a topic.

The latter is because the old high level consumer does not provide for commit 
per partition. It can only commit all offsets read by all partitions it is 
connected to [(Reference 
1)|http://grokbase.com/t/kafka/users/144b80h269/consumerconnector-commitoffsets].
 The above suggestion of one consumer connector per partition has been proposed 
by Kafka experts in many threads [(Reference 
2)|http://mail-archives.apache.org/mod_mbox/kafka-users/201409.mbox/%3CCAHBV8WeYj8ce6G5J0k3a1hGgdNskGv3bsaP8JXSM=kwbnuj...@mail.gmail.com%3E].

The other option could be to move to the newer consumer API in Kafka (with 
0.9+) that (I think) provides better options for handling a per partition 
commit. However, the new consumer is still marked beta, so not really sure. Can 
check with some Kafka committers internally.

For now, I will try out the first approach and see. In the meantime, happy to 
hear feedback from others.

> Kafka messages in ATLAS_HOOK might be lost in HA mode at the instant of 
> failover.
> -
>
> Key: ATLAS-629
> URL: https://issues.apache.org/jira/browse/ATLAS-629
> Project: Atlas
>  Issue Type: Bug
>Affects Versions: 0.7-incubating
>Reporter: Hemanth Yamijala
>Assignee: Hemanth Yamijala
>Priority: Critical
> Fix For: 0.7-incubating
>
>
> Write data to Kafka continuously from Hive hook - can do this by writing a 
> script that constantly creates tables. Bring down the Active instance with 
> kill -9. Ensure writes continue after passive becomes active. The expectation 
> is the number of tables created and the number of tables in Atlas match.
> In one test, wrote 180 tables and switched over 6 times from one instance to 
> another. Found that 1 table was lost of the lot. i.e. 179 tables were 
> created, and 1 did not get in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ATLAS-629) Kafka messages in ATLAS_HOOK might be lost in HA mode at the instant of failover.

2016-04-04 Thread Hemanth Yamijala (JIRA)

[ 
https://issues.apache.org/jira/browse/ATLAS-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223987#comment-15223987
 ] 

Hemanth Yamijala commented on ATLAS-629:


On a related note, the number of messages written to ATLAS_ENTITIES also 
differed from a case with switchovers to one without. Both of these issues are 
related to our auto-acknowledging Kafka messages before they are committed to 
the system, without a replay capability.

> Kafka messages in ATLAS_HOOK might be lost in HA mode at the instant of 
> failover.
> -
>
> Key: ATLAS-629
> URL: https://issues.apache.org/jira/browse/ATLAS-629
> Project: Atlas
>  Issue Type: Bug
>Reporter: Hemanth Yamijala
>Priority: Critical
> Fix For: 0.7-incubating
>
>
> Write data to Kafka continuously from Hive hook - can do this by writing a 
> script that constantly creates tables. Bring down the Active instance with 
> kill -9. Ensure writes continue after passive becomes active. The expectation 
> is the number of tables created and the number of tables in Atlas match.
> In one test, wrote 180 tables and switched over 6 times from one instance to 
> another. Found that 1 table was lost of the lot. i.e. 179 tables were 
> created, and 1 did not get in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)