[jira] [Commented] (HIVE-15947) Enhance Templeton service job operations reliability

Hive QA (JIRA) Fri, 03 Mar 2017 20:56:30 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895516#comment-15895516
 ]


Hive QA commented on HIVE-15947:
--------------------------------



Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12855957/HIVE-15947.4.patch

{color:green}SUCCESS:{color} +1 due to 5 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 27 failed/errored test(s), 10300 tests 
executed
*Failed tests:*
{noformat}
TestCommandProcessorFactory - did not produce a TEST-*.xml file (likely timed 
out) (batchId=272)
TestDbTxnManager - did not produce a TEST-*.xml file (likely timed out) 
(batchId=272)
TestDummyTxnManager - did not produce a TEST-*.xml file (likely timed out) 
(batchId=272)
TestHiveInputSplitComparator - did not produce a TEST-*.xml file (likely timed 
out) (batchId=272)
TestIndexType - did not produce a TEST-*.xml file (likely timed out) 
(batchId=272)
TestSplitFilter - did not produce a TEST-*.xml file (likely timed out) 
(batchId=272)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[escape_comments] 
(batchId=229)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[schema_evol_text_vec_table]
 (batchId=147)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=224)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=224)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[vector_between_in] 
(batchId=119)
org.apache.hive.beeline.TestSchemaTool.testNestedScriptsForDerby (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testNestedScriptsForMySQL (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testNestedScriptsForOracle (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testPostgresFilter (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testSchemaInit (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testSchemaInitDryRun (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testSchemaUpgrade (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testSchemaUpgradeDryRun (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testScriptMultiRowComment (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testScriptWithDelimiter (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testScripts (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testValidateLocations (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testValidateNullValues (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testValidateSchemaTables (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testValidateSchemaVersions (batchId=212)
org.apache.hive.beeline.TestSchemaTool.testValidateSequences (batchId=212)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3932/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3932/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3932/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 27 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12855957 - PreCommit-HIVE-Build

> Enhance Templeton service job operations reliability
> ----------------------------------------------------
>
>                 Key: HIVE-15947
>                 URL: https://issues.apache.org/jira/browse/HIVE-15947
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Subramanyam Pattipaka
>            Assignee: Subramanyam Pattipaka
>         Attachments: HIVE-15947.2.patch, HIVE-15947.3.patch, 
> HIVE-15947.4.patch, HIVE-15947.patch
>
>
> Currently Templeton service doesn't restrict number of job operation 
> requests. It simply accepts and tries to run all operations. If more number 
> of concurrent job submit requests comes then the time to submit job 
> operations can increase significantly. Templetonused hdfs to store staging 
> file for job. If HDFS storage can't respond to large number of requests and 
> throttles then the job submission can take very large times in order of 
> minutes.
> This behavior may not be suitable for all applications and client 
> applications  may be looking for predictable and low response for successful 
> request or send throttle response to client to wait for some time before 
> re-requesting job operation.
> In this JIRA, I am trying to address following job operations 
> 1) Submit new Job
> 2) Get Job Status
> 3) List jobs
> These three operations has different complexity due to variance in use of 
> cluster resources like YARN/HDFS.
> The idea is to introduce a new config templeton.job.submit.exec.max-procs 
> which controls maximum number of concurrent active job submissions within 
> Templeton and use this config to control better response times. If a new job 
> submission request sees that there are already 
> templeton.job.submit.exec.max-procs jobs getting submitted concurrently then 
> the request will fail with Http error 503 with reason 
>    “Too many concurrent job submission requests received. Please wait for 
> some time before retrying.”
>  
> The client is expected to catch this response and retry after waiting for 
> some time. The default value for the config 
> templeton.job.submit.exec.max-procs is set to ‘0’. This means by default job 
> submission requests are always accepted. The behavior needs to be enabled 
> based on requirements.
> We can have similar behavior for Status and List operations with configs 
> templeton.job.status.exec.max-procs and templeton.list.job.exec.max-procs 
> respectively.
> Once the job operation is started, the operation can take longer time. The 
> client which has requested for job operation may not be waiting for 
> indefinite amount of time. This work introduces configurations
> templeton.exec.job.submit.timeout
> templeton.exec.job.status.timeout
> templeton.exec.job.list.timeout
> to specify maximum amount of time job operation can execute. If time out 
> happens then list and status job requests returns to client with message
> "List job request got timed out. Please retry the operation after waiting for 
> some time."
> If submit job request gets timed out then 
>       i) The job submit request thread which receives time out will check if 
> valid job id is generated in job request.
>       ii) If it is generated then issue kill job request on cancel thread 
> pool. Don't wait for operation to complete and returns to client with time 
> out message. 
> Side effects of enabling time out for submit operations
> 1) This has a possibility for having active job for some time by the client 
> gets response and a list operation from client could potential show the newly 
> created job before it gets killed.
> 2) We do best effort to kill the job and no guarantees. This means there is a 
> possibility of duplicate job created. One possible reason for this could be a 
> case where job is created and then operation timed out but kill request 
> failed due to resource manager unavailability. When resource manager 
> restarts, it will restarts the job which got created.
> Fixing this scenario is not part of the scope of this JIRA. The job operation 
> functionality can be enabled only if above side effects are acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-15947) Enhance Templeton service job operations reliability

Reply via email to