[
https://issues.apache.org/jira/browse/HIVE-15473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15842353#comment-15842353
]
anishek edited comment on HIVE-15473 at 1/27/17 7:16 AM:
---------------------------------------------------------
There are few observations / limitations that [~thejas] had cited while
reviewing this. Writing down the reasoning here and steps of how we can move
forward.
Given that we use SynchronizedHandler for the client on beeline side, only one
operation / api at a time can be in execution from a single beeline session to
hiveserver2. Current flow of how the progress bar is updated on the client side
is
Thread 1 -- does statement execution: This is achieved by calling
GetOperationStatus for the operation from beeline till the execution of the
operation is complete. The server side implementation of GetOperationStatus
uses a timeout mechanism (which waits for the query execution to finish),
before it sends the status to the client. The time value is decided by a step
function, where for long running queries this can lead to a approx wait time of
5 seconds per call to GetOperationStatus .
Thread 2 -- prints query Logs and progress logs.
*Problem Space:*
# Since the client synchronizes the various api calls, This effectively means
that only one api from either Thread 1 / Thread 2 is executed at at time and
the notion of trying to project concurrent execution capability in code for
beeline seems misleading and hence with the current patch the progress bar /
query log updates can be delayed by at least 5+ seconds ( _I dont think we can
avoid this anyways, as i will discuss later_ ).
# Additionally, since there is no *order* of threads requesting synchronization
on a object is maintained, there is a possibility that Thread 1 can get the
next lock on the object without Thread 2 getting a chance to obtain the lock,
thus leading to long delays in updating the Query Log or Progress log ( _I am
not sure how this will happen for use case of long running queries as while
Thread 1 is executing , Thread 2 would already have blocked on the synchronize
of the object. Once Thread 1 completes and before it comes around the while
loop in_
{code}
HiveStatement.waitForOperationToComplete()
{code}
_Thread 2 should start executing, it seems highly improbable that, thread 1
completes and executes additional statements and gets the lock again before
Thread 2 gets a chance to acquire the lock_ )
So in summary:
* Prevent multi threaded code in beeline for interactions with hiveserver2, as
no concurrency is supported by the Thrift protocol, unless we move to
ThriftHttpCliService using Http based connection, or use NonBlockingThrift
server for binary protocol on the server side.
* Address the issue of responsiveness if we can.
*Solution Space:*
Since concurrent execution is not supported programming anything, to that
effect should be avoided in beeline client. Hence, we strive to remove the
multi threaded code from beeline side, in effect, moving the query log and
progress bar log to merge with the GetOperationStatus api. This would still not
address the issue of responsiveness as indicated in 1. above as the
GetOperationStatus will use the wait time before responding to calls from
beeline side, unless we decide to remove this, or reduce the wait time to a
default value of say 500 milliseconds, not sure why the step function is used
-- _to prevent server from wasting CPU resources on non-critical operations ?_
. This will address 2. above though since we are going to get all the
information in a single call.
*Implementation Considerations:*
# Merge QueryLog and ProgressBarLog request / response as part of
GetOperationStatus.
# To get this working we have to extend HiveStatement to include few non JDBC
compliant setters ( one interface for displaying progress bar, other for
displaying query logs) -- default implementations for these will be _do
nothing_ implementations
# Have setters on hive statement for both the interfaces, used by beeline to
provide required implementations.
# As part of hive statement execute(*) call, we create appropriate request if
custom implementations of the interfaces are provided above.
# There will be additional function signature for GetOperationStatus that we
might need to create to allow for backward compatibility reasons.
# _Not related to above_ : make sure we pass the vertex progress as string (for
progress bar display) and query progress as custom enum for decision making(and
implementations on server side to map from execution engine based state to our
generic enum state).
If we are too worried about the responsiveness of the progress bar, or *2. in
Problem Space* being a major impediment for hive usage, we should go with the
new implementation proposal, else we just additionally implement *6. in
Implementation Considerations*
was (Author: anishek):
There are few observations / limitations that [~thejas] had cited while
reviewing this. Writing down the reasoning here and steps of how we can move
forward.
Given that we use SynchronizedHandler for the client on beeline side, only one
operation / api at a time can be in execution from a single beeline session to
hiveserver2. Current flow of how the progress bar is updated on the client side
is
Thread 1 -- does statement execution: This is achieved by calling
GetOperationStatus for the operation from beeline till the execution of the
operation is complete. The server side implementation of GetOperationStatus
uses a timeout mechanism (which waits for the query execution to finish),
before it sends the status to the client. The time value is decided by a step
function, where for long running queries this can lead to a approx wait time of
5 seconds per call to GetOperationStatus .
Thread 2 -- prints query Logs and progress logs.
*Problem Space:*
# Since the client synchronizes the various api calls, This effectively means
that only one api from either Thread 1 / Thread 2 is executed at at time and
the notion of trying to project concurrent execution capability in code for
beeline seems misleading and hence with the current patch the progress bar /
query log updates can be delayed by at least 5+ seconds ( _I dont think we can
avoid this anyways, as i will discuss later_ ).
# Additionally, since there is no *order* of threads requesting synchronization
on a object is maintained, there is a possibility that Thread 1 can get the
next lock on the object without Thread 2 getting a chance to obtain the lock,
thus leading to long delays in updating the Query Log or Progress log ( _I am
not sure how this will happen for use case of long running queries as while
Thread 1 is executing , Thread 2 would already have blocked on the synchronize
of the object. Once Thread 1 completes and before it comes around the while
loop in_
{code}
HiveStatement.waitForOperationToComplete()
{code}
_Thread 2 should start executing, it seems highly improbable that, thread 1
completes and executes additional statements and gets the lock again before
Thread 2 gets a chance to acquire the lock_ )
So in summary:
* Prevent multi threaded code in beeline for interactions with hiveserver2, as
no concurrency is supported by the Thrift protocol, unless we move to
ThriftHttpCliService using Http based connection, or use NonBlockingThrift
server for binary protocol on the server side.
* Address the issue of responsiveness if we can.
*Solution Space:*
Since concurrent execution is not supported programming anything, to that
effect should be avoided in beeline client. Hence, we strive to remove the
multi threaded code from beeline side, in effect, moving the query log and
progress bar log to merge with the GetOperationStatus api. This would still not
address the issue of responsiveness as indicated in 1. above as the
GetOperationStatus will use the wait time before responding to calls from
beeline side, unless we decide to remove this, or reduce the wait time to a
default value of say 500 milliseconds, not sure why the step function is used
-- _to prevent server from wasting CPU resources on non-critical operations ?_
. This will address 2. above though since we are going to get all the
information in a single call.
*Implementation Considerations:*
# Merge QueryLog and ProgressBarLog request / response as part of
GetOperationStatus.
# To get this working we have to extend HiveStatement to include few non JDBC
compliant setters ( one interface for displaying progress bar, other for
displaying query logs) -- default implementations for these will be _do
nothing_ implementations
# Have setters on hive statement for both the interfaces, used by beeline to
provide required implementations.
# As part of hive statement execute(*) call, we create appropriate request if
custom implementations of the interfaces are provided above.
# There will be additional function signature for GetOperationStatus that we
might need to create to allow for backward compatibility reasons.
# _Not related to above_ : make sure we pass the vertex progress as string (for
progress bar display) and query progress as custom enum for decision making(and
implementations on server side to map from execution engine based state to our
generic enum state).
If we are too worried about the responsiveness of the progress bar, or *2. in
Problem Space* being a major impediment for hive usage, we should go with the
new implementation proposal else just additionally implement with *5. in
Implementation Considerations*
> Progress Bar on Beeline client
> ------------------------------
>
> Key: HIVE-15473
> URL: https://issues.apache.org/jira/browse/HIVE-15473
> Project: Hive
> Issue Type: Improvement
> Components: Beeline, HiveServer2
> Affects Versions: 2.1.1
> Reporter: anishek
> Assignee: anishek
> Priority: Minor
> Attachments: HIVE-15473.2.patch, HIVE-15473.3.patch,
> HIVE-15473.4.patch, HIVE-15473.5.patch, screen_shot_beeline.jpg
>
>
> Hive Cli allows showing progress bar for tez execution engine as shown in
> https://issues.apache.org/jira/secure/attachment/12678767/ux-demo.gif
> it would be great to have similar progress bar displayed when user is
> connecting via beeline command line client as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)