[ 
https://issues.apache.org/jira/browse/HIVE-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18065574#comment-18065574
 ] 

Ramit Gupta commented on HIVE-29465:
------------------------------------

can I work on this one?

> Prevent excessive query results cache usage at runtime
> ------------------------------------------------------
>
>                 Key: HIVE-29465
>                 URL: https://issues.apache.org/jira/browse/HIVE-29465
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Priority: Major
>
> Currently, Hive cannot enforce avoiding the excessive spilling, because when 
> the results cache is enabled, the results dir is set to this cache folder, 
> [here|https://github.com/apache/hive/blob/399200af7cb11cf6ee3329ebdabe17792e5e7e85/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L7502-L7512],
>  so a given query's tasks directly write to this path. The only "easy" way to 
> achieve this is to automatically check the size at runtime (in the tasks), 
> fail the query, and re-run without query cache enabled.
> The original issue scenario was something like: queries kept excessively 
> spilling to Amazon EFS (see 1T folders), without having chance to intervene:
> {code}
> du -h -d 1 
> /efs/tmp/hive/_resultscache_/results-9d89cc59-c99d-46a5-9d93-2b550576532012.0K
>  ./66356edb-57a6-4f0a-90cd-7d14d9e2b739
> 12.0K ./2db559fc-e32a-499b-86a7-a1247e9c972e
> 36.0K ./6fa8257f-6ef7-463f-8b5f-fb4d08ab0310
> 256.0K ./cdd64f99-20b8-4ef7-be03-3f638cc5be63
> 8.0K ./695f67da-db48-46be-90c9-58f6a796e5a7
> 12.0K ./c109c9d8-f436-46ea-9c5a-9f6afc3c8dcd
> 20.0K ./6eed7d64-f9d1-4516-a036-fdcdbabf65b5
> 72.0K ./aec5bb69-8163-4288-9ef3-bdaa74fe9349
> 12.0K ./353be5fe-1043-49c0-8712-7f42af2b1273
> 12.0K ./efb201cf-32b0-4fb1-9d43-ec566f27dae4
> 1.1T ./0fe343fb-6a89-4d28-b2fd-caed2f2e42f6
> 12.0K ./90bde86b-a72d-4d37-a117-2ada5701b806
> 12.0K ./808d348a-364a-4c4c-b63e-6da5f5b0cb89
> 12.0K ./4172f6b3-74b8-414e-b22b-fe061a569b08
> 1.1T .
> {code}
> This is most probably because *hive.query.results.cache.max.size* and 
> *hive.query.results.cache.max.entry.size* could be only enforced in the end 
> in 
> [setEntryValid|https://github.com/apache/hive/blob/399200af7cb11cf6ee3329ebdabe17792e5e7e85/ql/src/java/org/apache/hadoop/hive/ql/cache/results/QueryResultsCache.java#L516-L525],
>  mind "Updates a pending cache entry with a FetchWork result from a finished 
> query"
> This might be an issue in case of large results sets, because even if we 
> accept such a result size on a given filesystem (that stores normal query 
> results), we don't necessarily accept the same in case of another filesystem 
> designated for query cache results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to