[jira] [Commented] (HADOOP-14576) DynamoDB tables may leave ACTIVE state after initial connection

2017-09-01 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16150616#comment-16150616
 ] 

Steve Loughran commented on HADOOP-14576:
-

DDB can -> updating after you change the capacity (as HADOOP-14220 will let you 
do). The retry handler needs to recognise the special case of 
update-in-progress and have a long retry for it

> DynamoDB tables may leave ACTIVE state after initial connection
> ---
>
> Key: HADOOP-14576
> URL: https://issues.apache.org/jira/browse/HADOOP-14576
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: 3.0.0-beta1
>Reporter: Sean Mackrory
>
> We currently only anticipate tables not being in the ACTIVE state when first 
> connecting. It is possible for a table to be in the ACTIVE state and move to 
> an UPDATING state during partitioning events. Attempts to read or write 
> during that time will result in an AmazonServerException getting thrown. We 
> should try to handle that better...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14576) DynamoDB tables may leave ACTIVE state after initial connection

2017-07-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076792#comment-16076792
 ] 

Steve Loughran commented on HADOOP-14576:
-

this parallel rename -it is how hive implements a commit? As if so, if we can 
move it off rename/(copy & delete) as its commit strategy, then that could make 
it go away.

Assuming it is a commit operation, we would presumably like it to complete, 
even if that took 1+ attempt to go through. Which means: something which should 
be retried.

I'm adding retry policy in the HADOOP-13786 commit code (and more fault 
injection into the inconsistent client); we can use its handling and tests as a 
basis for this. That branch/patch handles 503 throttled responses from S3 with 
backoff and retry (and a shorter policy for other failures considered 
recoverable)DDB state changes could be treated as another error to handle 
under the throttle policy.



> DynamoDB tables may leave ACTIVE state after initial connection
> ---
>
> Key: HADOOP-14576
> URL: https://issues.apache.org/jira/browse/HADOOP-14576
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Sean Mackrory
>
> We currently only anticipate tables not being in the ACTIVE state when first 
> connecting. It is possible for a table to be in the ACTIVE state and move to 
> an UPDATING state during partitioning events. Attempts to read or write 
> during that time will result in an AmazonServerException getting thrown. We 
> should try to handle that better...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14576) DynamoDB tables may leave ACTIVE state after initial connection

2017-07-05 Thread Aaron Fabbri (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16075123#comment-16075123
 ] 

Aaron Fabbri commented on HADOOP-14576:
---

Thanks for filing the JIRA, [~mackrorysd].

{quote}
My concern with failing over to S3 for non-auth read is what happens when 
you're listing stuff that isn't consistent on S3 yet. IMO non-auth mode is 
really just to enable lazily loading data that already existed or that is added 
outside of S3Guard. I don't think it should weaken guarantees in the presence 
of partitioning events in DynamoDB.
{quote}

Yeah I think both arguments have merit.

We could argue that failing back to basic S3 without consistency is better than 
failing a job.  We could also have a configuration flag that lets users choose 
either behavior.Since the chance of inconsistency is pretty low,  there is 
a good probability that running in degraded mode (no S3Guard) until the table 
comes back would be successful.

I'm not sure authoritative mode matters?  The client being in S3guard 
authoritative mode just means the the FS client *may* skip round trips to S3 
*if* the metadatastore reports it has a full listing.   Since MetadatStore 
throws an error, the "if metadatastore reports full listing" condition is not 
met.

> DynamoDB tables may leave ACTIVE state after initial connection
> ---
>
> Key: HADOOP-14576
> URL: https://issues.apache.org/jira/browse/HADOOP-14576
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Sean Mackrory
>
> We currently only anticipate tables not being in the ACTIVE state when first 
> connecting. It is possible for a table to be in the ACTIVE state and move to 
> an UPDATING state during partitioning events. Attempts to read or write 
> during that time will result in an AmazonServerException getting thrown. We 
> should try to handle that better...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14576) DynamoDB tables may leave ACTIVE state after initial connection

2017-06-23 Thread Sean Mackrory (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061659#comment-16061659
 ] 

Sean Mackrory commented on HADOOP-14576:


{quote}do you think this is something to need for an initial preview?{quote}

No. The fix would be DynamoDB-specific, so it's unlikely to be disruptive to 
other S3 development.

{quote}how long does it go into this state —can we spin until it is 
fixed?{quote}

The workload that hit this has run numerous times before and never had an 
issue. So it's possible that's because the window in which it happens is small 
enough to have been missed until now. I've been trying to pump data into a 
table and haven't been able to reproduce it. The workload in question was 
Hive's parallel rename (which gets aborted on the first failure), though, and 
not something like a MapReduce job, so there's no indication of how parallel or 
subsequent tasks would have been affected. It's also possible that other 
unplanned glitches server-side might look the same but take much longer. I'd 
suggest we consider reusing retryBackoff() from the getVersionMarker logic in 
the event of failures?

My concern with failing over to S3 for non-auth read is what happens when 
you're listing stuff that isn't consistent on S3 yet. IMO non-auth mode is 
really just to enable lazily loading data that already existed or that is added 
outside of S3Guard. I don't think it should weaken guarantees in the presence 
of partitioning events in DynamoDB.

> DynamoDB tables may leave ACTIVE state after initial connection
> ---
>
> Key: HADOOP-14576
> URL: https://issues.apache.org/jira/browse/HADOOP-14576
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Sean Mackrory
>
> We currently only anticipate tables not being in the ACTIVE state when first 
> connecting. It is possible for a table to be in the ACTIVE state and move to 
> an UPDATING state during partitioning events. Attempts to read or write 
> during that time will result in an AmazonServerException getting thrown. We 
> should try to handle that better...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14576) DynamoDB tables may leave ACTIVE state after initial connection

2017-06-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060628#comment-16060628
 ] 

Steve Loughran commented on HADOOP-14576:
-

# do you think this is something to need for an initial preview?
# how long does it go into this state —can we spin until it is fixed?
# For non-auth read, we could recover just by going to S3. For auth we don't 
have that luxury.

Recommend adding an instrumentation field to count how often this happens; 
would be useful in production to identify why delays were arising

> DynamoDB tables may leave ACTIVE state after initial connection
> ---
>
> Key: HADOOP-14576
> URL: https://issues.apache.org/jira/browse/HADOOP-14576
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: fs/s3
>Affects Versions: HADOOP-13345
>Reporter: Sean Mackrory
>
> We currently only anticipate tables not being in the ACTIVE state when first 
> connecting. It is possible for a table to be in the ACTIVE state and move to 
> an UPDATING state during partitioning events. Attempts to read or write 
> during that time will result in an AmazonServerException getting thrown. We 
> should try to handle that better...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org