[
https://issues.apache.org/jira/browse/IMPALA-9819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Armstrong updated IMPALA-9819:
----------------------------------
Component/s: Backend
> Separate data cache and HDFS scan node runtime profile metrics
> --------------------------------------------------------------
>
> Key: IMPALA-9819
> URL: https://issues.apache.org/jira/browse/IMPALA-9819
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Sahil Takiar
> Assignee: Joe McDonnell
> Priority: Major
>
> When a query reads data from both a remote storage system (e.g. S3) and the
> data cache, the HDFS_SCAN_NODE runtime profiles are hard to reason about.
> For example, in the following runtime profile snippet:
> {code:java}
> HDFS_SCAN_NODE (id=0):(Total: 59s374ms, non-child: 0.000ns, % non-child:
> 0.00%)
> - AverageHdfsReadThreadConcurrency: 0.62
> - AverageScannerThreadConcurrency: 0.91
> - BytesRead: 587.97 MB (616533483)
> - BytesReadDataNodeCache: 0
> - BytesReadLocal: 0
> - BytesReadRemoteUnexpected: 0
> - BytesReadShortCircuit: 0
> - CachedFileHandlesHitCount: 323 (323)
> - CachedFileHandlesMissCount: 94 (94)
> - CollectionItemsRead: 0 (0)
> - DataCacheHitBytes: 212.00 MB (222294996)
> - DataCacheHitCount: 107 (107)
> - DataCacheMissBytes: 375.98 MB (394238486)
> - DataCacheMissCount: 310 (310)
> - DataCachePartialHitCount: 0 (0)
> - DecompressionTime: 2s428ms
> - MaterializeTupleTime: 19s444ms
> - MaxCompressedTextFileLength: 0
> - NumColumns: 3 (3)
> - NumDictFilteredRowGroups: 0 (0)
> - NumDisksAccessed: 1 (1)
> - NumPages: 53.30K (53300)
> - NumRowGroups: 83 (83)
> - NumRowGroupsWithPageIndex: 83 (83)
> - NumScannerThreadMemUnavailable: 0 (0)
> - NumScannerThreadReservationsDenied: 0 (0)
> - NumScannerThreadsStarted: 1 (1)
> - NumScannersWithNoReads: 0 (0)
> - NumStatsFilteredPages: 0 (0)
> - NumStatsFilteredRowGroups: 0 (0)
> - PeakMemoryUsage: 16.00 MB (16781312)
> - PeakScannerThreadConcurrency: 1 (1)
> - PerReadThreadRawHdfsThroughput: 15.11 MB/sec
> - RemoteScanRanges: 0 (0)
> - RowBatchBytesEnqueued: 670.68 MB (703260541)
> - RowBatchQueueGetWaitTime: 59s368ms
> - RowBatchQueuePeakMemoryUsage: 4.17 MB (4368285)
> - RowBatchQueuePutWaitTime: 0.000ns
> - RowBatchesEnqueued: 915 (915)
> - RowsRead: 413.47M (413466507)
> - RowsReturned: 722.27K (722275)
> - RowsReturnedRate: 12.17 K/sec
> - ScanRangesComplete: 83 (83)
> - ScannerIoWaitTime: 33s454ms
> - ScannerThreadWorklessLoops: 0 (0)
> - ScannerThreadsInvoluntaryContextSwitches: 1.94K (1940)
> - ScannerThreadsTotalWallClockTime: 1m
> - ScannerThreadsSysTime: 1s181ms
> - ScannerThreadsUserTime: 20s581ms
> - ScannerThreadsVoluntaryContextSwitches: 770 (770)
> - TotalRawHdfsOpenFileTime: 3s396ms
> - TotalRawHdfsReadTime: 38s940ms
> - TotalReadThroughput: 8.86 MB/sec {code}
> The query scanned part of the data from S3 and part of the data from the data
> cache.
> The confusing part is that metrics such as PerReadThreadRawHdfsThroughput are
> measured across S3 and data cache reads. So there is no straightforward way
> to determine the throughput for *just* S3 reads. Users might want this value
> to determine if S3 was particularly slow for their query.
> It would be nice if the scan node metrics more clearly differentiate between
> reads from S3 vs. the data cache. The aggregate metrics (*Total* metrics) are
> still useful, but it would be useful to have fine-grained metrics that are
> specific to a data storage system (e.g. either the data cache or S3).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]