[ 
https://issues.apache.org/jira/browse/IMPALA-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Maheshwari updated IMPALA-3289:
--------------------------------------
    Description: 
When disk performance is drastically degraded during query execution Impala 
will not recognize this and the query will appear to "hang". A threshold could 
be set for disk IO performance below which there should not be any more 
fragments allocated to the node and that node should be marked as degraded and 
removed from executor list for a defined amount of time after which it could be 
retried if the node has recovered.

Some error messages - 
{code:java}
E0226 07:27:41.546187 14795 tmp-file-mgr.cc:211] 
0541dda3dc371844:c22b251700000000] Error for temporary file 
'/data2/impala/impalad/impala-scratch/0541dda3dc371844:c22b251700000000_6b169f50-3a00-4ce6-a19e-fe9360aaed87':
 Disk I/O error on gbrpsr000012838:22000: open() failed for 
/data2/impala/impalad/impala-scratch/0541dda3dc371844:c22b251700000000_6b169f50-3a00-4ce6-a19e-fe9360aaed87.
 Disk level I/O error occured. errno=5

W1028 21:00:05.312568 56851 DfsClientShmManager.java:365] 
EndpointShmManager(DatanodeInfoWithStorage[22.50.92.142:1004,DS-4af8e8f7-c6b6-43e7-8a0a-19d445a7a32e,DISK],
 parent=ShortCircuitShmManager(2301f5f2)): error shutting down shm: got 
IOException calling shutdown(SHUT_RDWR) 

impalad.WARNING:W0226 15:15:45.458577 25224 BlockReaderFactory.java:647] 
0d43c912dd091557:ab21fb05000000f5] 
BlockReaderFactory(fileName=/warehouse/datalake/AAAAAAAAA.dat, 
block=BP-1018268685-35.49.40.158-1438950312819:blk_5003986013_3938223123): 
unknown response code ERROR while attempting to set up short-circuit access. 
RegisteredShm(62d9cfb1e2af3c6697ace97f93109c88): slot 125 is already in use.. 
Short-circuit read for DataNode 
DatanodeInfoWithStorage[22.50.92.142:1004,DS-ce9c7134-ad13-47fc-93c0-8cec6c3f3e7e,DISK]
 is disabled temporarily for 1 seconds based on 
dfs.domain.socket.disable.interval.seconds.{code}
Ref - 
[https://github.com/apache/impala/blob/7f190c4625f26cb375c0b0fa504ecb0887a70048/be/src/runtime/io/disk-io-mgr-test.cc#L556]

  was:
When disk performance is drastically degraded during query execution Impala 
will not recognize this and the query will appear to "hang". A threshold could 
be set for disk IO performance below which there should not be any more 
fragments allocated to the node and that node should be marked as degraded and 
removed from executor list for a defined amount of time after which it could be 
retried if the node has recovered.

Some error messages - 
{code:java}
E0226 07:27:41.546187 14795 tmp-file-mgr.cc:211] 
0541dda3dc371844:c22b251700000000] Error for temporary file 
'/data2/impala/impalad/impala-scratch/0541dda3dc371844:c22b251700000000_6b169f50-3a00-4ce6-a19e-fe9360aaed87':
 Disk I/O error on gbrpsr000012838.intranet.barcapint.com:22000: open() failed 
for 
/data2/impala/impalad/impala-scratch/0541dda3dc371844:c22b251700000000_6b169f50-3a00-4ce6-a19e-fe9360aaed87.
 Disk level I/O error occured. errno=5

W1028 21:00:05.312568 56851 DfsClientShmManager.java:365] 
EndpointShmManager(DatanodeInfoWithStorage[22.50.92.142:1004,DS-4af8e8f7-c6b6-43e7-8a0a-19d445a7a32e,DISK],
 parent=ShortCircuitShmManager(2301f5f2)): error shutting down shm: got 
IOException calling shutdown(SHUT_RDWR) 

impalad.WARNING:W0226 15:15:45.458577 25224 BlockReaderFactory.java:647] 
0d43c912dd091557:ab21fb05000000f5] 
BlockReaderFactory(fileName=/warehouse/datalake/AAAAAAAAA.dat, 
block=BP-1018268685-35.49.40.158-1438950312819:blk_5003986013_3938223123): 
unknown response code ERROR while attempting to set up short-circuit access. 
RegisteredShm(62d9cfb1e2af3c6697ace97f93109c88): slot 125 is already in use.. 
Short-circuit read for DataNode 
DatanodeInfoWithStorage[22.50.92.142:1004,DS-ce9c7134-ad13-47fc-93c0-8cec6c3f3e7e,DISK]
 is disabled temporarily for 1 seconds based on 
dfs.domain.socket.disable.interval.seconds.{code}
Ref - 
https://github.com/apache/impala/blob/7f190c4625f26cb375c0b0fa504ecb0887a70048/be/src/runtime/io/disk-io-mgr-test.cc#L556


> Disk performance threshold to avoid "hang"
> ------------------------------------------
>
>                 Key: IMPALA-3289
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3289
>             Project: IMPALA
>          Issue Type: New Feature
>          Components: Backend
>    Affects Versions: Impala 2.3.0
>            Reporter: Thomas Scott
>            Priority: Minor
>              Labels: resource-management
>
> When disk performance is drastically degraded during query execution Impala 
> will not recognize this and the query will appear to "hang". A threshold 
> could be set for disk IO performance below which there should not be any more 
> fragments allocated to the node and that node should be marked as degraded 
> and removed from executor list for a defined amount of time after which it 
> could be retried if the node has recovered.
> Some error messages - 
> {code:java}
> E0226 07:27:41.546187 14795 tmp-file-mgr.cc:211] 
> 0541dda3dc371844:c22b251700000000] Error for temporary file 
> '/data2/impala/impalad/impala-scratch/0541dda3dc371844:c22b251700000000_6b169f50-3a00-4ce6-a19e-fe9360aaed87':
>  Disk I/O error on gbrpsr000012838:22000: open() failed for 
> /data2/impala/impalad/impala-scratch/0541dda3dc371844:c22b251700000000_6b169f50-3a00-4ce6-a19e-fe9360aaed87.
>  Disk level I/O error occured. errno=5
> W1028 21:00:05.312568 56851 DfsClientShmManager.java:365] 
> EndpointShmManager(DatanodeInfoWithStorage[22.50.92.142:1004,DS-4af8e8f7-c6b6-43e7-8a0a-19d445a7a32e,DISK],
>  parent=ShortCircuitShmManager(2301f5f2)): error shutting down shm: got 
> IOException calling shutdown(SHUT_RDWR) 
> impalad.WARNING:W0226 15:15:45.458577 25224 BlockReaderFactory.java:647] 
> 0d43c912dd091557:ab21fb05000000f5] 
> BlockReaderFactory(fileName=/warehouse/datalake/AAAAAAAAA.dat, 
> block=BP-1018268685-35.49.40.158-1438950312819:blk_5003986013_3938223123): 
> unknown response code ERROR while attempting to set up short-circuit access. 
> RegisteredShm(62d9cfb1e2af3c6697ace97f93109c88): slot 125 is already in use.. 
> Short-circuit read for DataNode 
> DatanodeInfoWithStorage[22.50.92.142:1004,DS-ce9c7134-ad13-47fc-93c0-8cec6c3f3e7e,DISK]
>  is disabled temporarily for 1 seconds based on 
> dfs.domain.socket.disable.interval.seconds.{code}
> Ref - 
> [https://github.com/apache/impala/blob/7f190c4625f26cb375c0b0fa504ecb0887a70048/be/src/runtime/io/disk-io-mgr-test.cc#L556]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to