[ https://issues.apache.org/jira/browse/IMPALA-7014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472420#comment-16472420 ]
Tim Armstrong commented on IMPALA-7014: --------------------------------------- I don't think we can justify spending 75ms of CPU time for each one of these even if they are occasionally useful. I personally think the reasonable options here are: * Disable the slow symbolisation. * Disable the slow symbolisation, invest some engineering effort into making symbolisation fast, then turn on the fast symbolisation. Which is better depends on whether it is more impactful than other things we could do to improve supportability. We've definitely wasted way more time chasing down performance problems from this than we've gained benefit from it. What I've generally found is that the backtrace itself is at best marginally useful, because these are all application-level errors that should be including all of the relevant information in the error message itself (or in a log message). E.g. in the case above there's only really one code path that produces that error, and the interesting thing is what file had the problem. There are definitely cases where it's a useful backup to figure out where an uninformative error message came from, but we should be fixing the error messages in those cases anyway. I'd contrast this to Java, where stacktraces for exceptions are tremendously useful for runtime errors like NullPointerException that aren't debuggable without knowing where they came from in the code. > Disable stacktrace symbolisation by default > ------------------------------------------- > > Key: IMPALA-7014 > URL: https://issues.apache.org/jira/browse/IMPALA-7014 > Project: IMPALA > Issue Type: Improvement > Components: Backend > Affects Versions: Not Applicable > Reporter: Tim Armstrong > Assignee: Joe McDonnell > Priority: Critical > > We got burned by the code of producing stacktrace again with IMPALA-6996. I > did a quick investigation into this, based on the hypothesis that the > symbolisation was the expensive part, rather than getting the addresses. I > added a stopwatch to GetStackTrace() to measure the time in nanoseconds and > ran a test that produces a backtrace > The first experiment was > {noformat} > $ start-impala-cluster.py --impalad_args='--symbolize_stacktrace=true' && > impala-py.test tests/query_test/test_scanners.py -k codec > I0511 09:45:11.897944 30904 debug-util.cc:283] stacktrace time: 75175573 > I0511 09:45:11.897956 30904 status.cc:125] File > 'hdfs://localhost:20500/test-warehouse/test_bad_compression_codec_308108.db/bad_codec/bad_codec.parquet' > uses an unsupported compression: 5000 for column 'id'. > @ 0x18782ef impala::Status::Status() > @ 0x2cbe96f > impala::ParquetMetadataUtils::ValidateRowGroupColumn() > @ 0x205f597 impala::BaseScalarColumnReader::Reset() > @ 0x1feebe6 impala::HdfsParquetScanner::InitScalarColumns() > @ 0x1fe6ff3 impala::HdfsParquetScanner::NextRowGroup() > @ 0x1fe58d8 impala::HdfsParquetScanner::GetNextInternal() > @ 0x1fe3eea impala::HdfsParquetScanner::ProcessSplit() > @ 0x1f6ba36 impala::HdfsScanNode::ProcessSplit() > @ 0x1f6adc4 impala::HdfsScanNode::ScannerThread() > @ 0x1f6a1c4 > _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_18ThreadResourcePoolEENKUlvE_clEv > @ 0x1f6c2a6 > _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_18ThreadResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE > @ 0x1bd3b1a boost::function0<>::operator()() > @ 0x1ebecd5 impala::Thread::SuperviseThread() > @ 0x1ec6e71 boost::_bi::list5<>::operator()<>() > @ 0x1ec6d95 boost::_bi::bind_t<>::operator()() > @ 0x1ec6d58 boost::detail::thread_data<>::run() > @ 0x31b3ada thread_proxy > @ 0x7f9be67d36ba start_thread > @ 0x7f9be650941d clone > {noformat} > The stacktrace took 75ms, which is pretty bad! It would be worse on a > production system with more memory maps. > The next experiment was to disable it: > {noformat} > start-impala-cluster.py --impalad_args='--symbolize_stacktrace=false' && > impala-py.test tests/query_test/test_scanners.py -k codec > I0511 09:43:47.574185 29514 debug-util.cc:283] stacktrace time: 29528 > I0511 09:43:47.574193 29514 status.cc:125] File > 'hdfs://localhost:20500/test-warehouse/test_bad_compression_codec_cb5d0225.db/bad_codec/bad_codec.parquet' > uses an unsupported compression: 5000 for column 'id'. > @ 0x18782ef > @ 0x2cbe96f > @ 0x205f597 > @ 0x1feebe6 > @ 0x1fe6ff3 > @ 0x1fe58d8 > @ 0x1fe3eea > @ 0x1f6ba36 > @ 0x1f6adc4 > @ 0x1f6a1c4 > @ 0x1f6c2a6 > @ 0x1bd3b1a > @ 0x1ebecd5 > @ 0x1ec6e71 > @ 0x1ec6d95 > @ 0x1ec6d58 > @ 0x31b3ada > @ 0x7fbdcbdef6ba > @ 0x7fbdcbb2541d > {noformat} > That's 2545x faster! If the addresses are in the statically linked binary, we > can use addrline to get back the line numbers: > {noformat} > $ addr2line -e be/build/latest/service/impalad 0x2cbe96f > /home/tarmstrong/Impala/incubator-impala/be/src/exec/parquet-metadata-utils.cc:166 > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org