[
https://issues.apache.org/jira/browse/IMPALA-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17316832#comment-17316832
]
wesleydeng commented on IMPALA-10578:
-------------------------------------
[~stigahuang] I found the reason!Because I make the scratch dir (spill to disk
) in the same disk with impalad log dir.
When big query spill data to disk, small query block in flush log to disk.
Below i show the thread stack I capture by pstack.
Thread 2 (Thread 0x7fb0ed9f5700 (LWP 28910)):
#0 write () from /lib64/libc.so.6
#1 _IO_new_file_write () from /lib64/libc.so.6
#2 __GI__IO_do_write () from /lib64/libc.so.6
#3 __GI__IO_file_xsputn () from /lib64/libc.so.6
#4 fwrite () from /lib64/libc.so.6
#5 google::(anonymous namespace)::LogFileObject::Write(bool, long, char const*,
int) ()
#6 google::LogMessage::SendToLog() ()
#7 google::LogMessage::Flush() ()
#8 google::LogMessage::~LogMessage() ()
#9 impala::KrpcDataStreamRecvr::Close (this=0x7fb36061f9c0) at
/opt/Impala/be/src/runtime/krpc-data-stream-recvr.cc:829
Thread 407 (Thread 0x7fd22cfe0700 (LWP 5309)):
#0 0x00007fd2cd2213ae in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1 0x0000000004e9b514 in google::LogMessage::Flush() ()
#2 0x0000000004e9b719 in google::LogMessage::~LogMessage() ()
#3 0x00000000022fdbc4 in impala::ControlService::ExecQueryFInstances
(this=0x195156d0, request=0x1bd67740, response=0x19396300,
rpc_context=0x20845780) at /opt/Impala/be/src/service/control-service.cc:152
and many other place where do flush:
google::LogMessage::~LogMessage() ()
impala::KrpcDataStreamRecvr::SenderQueue::GetBatch (this=0x231099c0,
next_batch=0x1cd60ea0) at
/opt/Impala/be/src/runtime/krpc-data-stream-recvr.cc:258
google::LogMessage::~LogMessage() ()
impala::KrpcDataStreamMgr::Maintenance (this=0x17ae2d80) at
/opt/Impala/be/src/runtime/krpc-data-stream-mgr.cc:422
google::LogMessage::~LogMessage() ()
impala::QueryState::ReportExecStatus (this=0x127723e00) at
/opt/Impala/be/src/runtime/query-state.cc:501
> Big Query influence other query seriously when hardware not reach limit
> ------------------------------------------------------------------------
>
> Key: IMPALA-10578
> URL: https://issues.apache.org/jira/browse/IMPALA-10578
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 3.4.0
> Environment: impala-3.4
> 80 machines with 96 cpu and 256GB mem
> scratch-dir is on separate disk different from HDFS data dir
> Reporter: wesleydeng
> Priority: Major
> Attachments: big_query.txt.bz2, image-2021-03-10-19-59-24-188.png,
> image-2021-03-16-16-32-37-862.png, small_query_be_influenced_very_slow.txt.bz2
>
>
> When a big query is running(use mt_dop=8), other query is very difficult to
> start.
> A small query (select distinct one field from a small table) may take about
> 1 minutes, normallly it take only about 1~3 second.
> From the impalad log, I found a incomprehensible log like this:
> !image-2021-03-16-16-32-37-862.png|width=836,height=189!
> !image-2021-03-10-19-59-24-188.png|width=892,height=435!
> ---------------
> About the gap between "Handling call" and "Deserializing Batch", I found
> another path :
> --KrpcDataStreamRecvr::SenderQueue::AddBatch
> ----EnqueueDeferredRpc(move(payload), l); // after dequeue, will call
> KrpcDataStreamRecvr::SenderQueue::AddBatchWork
> ---------------
>
>
> When the Big query is running, data spilled has happened because mem_limit
> was set and this big query waste a lot of memory.
>
> In the attchment, I append the profile of big query and small query. The
> small query can be finished in seconds normally. the timeline of small query
> show as below:
> Query Timeline: 21m39s
> - Query submitted: 48.846us (48.846us)
> - Planning finished: 2.934ms (2.886ms)
> - Submit for admission: 12.572ms (9.637ms)
> - Completed admission: 13.622ms (1.050ms)
> - Ready to start on 56 backends: 15.271ms (1.649ms)
> -- All 56 execution backends (171 fragment instances) started: 18s505ms
> (18s489ms)*
> - Rows available: 51s770ms (33s265ms)
> - First row fetched: 57s220ms (5s449ms)
> - Last row fetched: 59s119ms (1s899ms)
> - Released admission control resources: 1m1s (2s223ms)
> - AdmissionControlTimeSinceLastUpdate: 80.000ms
> - ComputeScanRangeAssignmentTimer: 439.749us
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]