[ 
https://issues.apache.org/jira/browse/HIVE-29465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-29465:
--------------------------------
    Description: 
Currently, Hive cannot enforce avoiding the excessive spilling, because when 
the results cache is enabled, the results dir is set to this cache folder, 
[here|https://github.com/apache/hive/blob/399200af7cb11cf6ee3329ebdabe17792e5e7e85/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L7502-L7512],
 so a given query's tasks directly write to this path. The only "easy" way to 
achieve this is to automatically check the size at runtime (in the tasks), fail 
the query, and re-run without query cache enabled.

The original issue scenario was something like: queries kept excessively 
spilling to Amazon EFS (see 1T folders), without having chance to intervene:
{code}
du -h -d 1 
/efs/tmp/hive/_resultscache_/results-9d89cc59-c99d-46a5-9d93-2b550576532012.0K 
./66356edb-57a6-4f0a-90cd-7d14d9e2b739
12.0K ./2db559fc-e32a-499b-86a7-a1247e9c972e
36.0K ./6fa8257f-6ef7-463f-8b5f-fb4d08ab0310
256.0K ./cdd64f99-20b8-4ef7-be03-3f638cc5be63
8.0K ./695f67da-db48-46be-90c9-58f6a796e5a7
12.0K ./c109c9d8-f436-46ea-9c5a-9f6afc3c8dcd
20.0K ./6eed7d64-f9d1-4516-a036-fdcdbabf65b5
72.0K ./aec5bb69-8163-4288-9ef3-bdaa74fe9349
12.0K ./353be5fe-1043-49c0-8712-7f42af2b1273
12.0K ./efb201cf-32b0-4fb1-9d43-ec566f27dae4
1.1T ./0fe343fb-6a89-4d28-b2fd-caed2f2e42f6
12.0K ./90bde86b-a72d-4d37-a117-2ada5701b806
12.0K ./808d348a-364a-4c4c-b63e-6da5f5b0cb89
12.0K ./4172f6b3-74b8-414e-b22b-fe061a569b08
1.1T .
{code}

This is most probably because *hive.query.results.cache.max.size* and 
*hive.query.results.cache.max.entry.size* could be only enforced in the end in 
[setEntryValid|https://github.com/apache/hive/blob/399200af7cb11cf6ee3329ebdabe17792e5e7e85/ql/src/java/org/apache/hadoop/hive/ql/cache/results/QueryResultsCache.java#L516-L525]




  was:Currently, Hive cannot enforce avoiding the excessive spilling, because 
when the results cache is enabled, the results dir is set to this cache folder, 
[here|https://github.com/apache/hive/blob/399200af7cb11cf6ee3329ebdabe17792e5e7e85/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L7502-L7512],
 so a given query's tasks directly write to this path. The only "easy" way to 
achieve this is to automatically check the size at runtime (in the tasks), fail 
the query, and re-run without query cache enabled.


> Prevent excessive query results cache usage at runtime
> ------------------------------------------------------
>
>                 Key: HIVE-29465
>                 URL: https://issues.apache.org/jira/browse/HIVE-29465
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Priority: Major
>
> Currently, Hive cannot enforce avoiding the excessive spilling, because when 
> the results cache is enabled, the results dir is set to this cache folder, 
> [here|https://github.com/apache/hive/blob/399200af7cb11cf6ee3329ebdabe17792e5e7e85/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L7502-L7512],
>  so a given query's tasks directly write to this path. The only "easy" way to 
> achieve this is to automatically check the size at runtime (in the tasks), 
> fail the query, and re-run without query cache enabled.
> The original issue scenario was something like: queries kept excessively 
> spilling to Amazon EFS (see 1T folders), without having chance to intervene:
> {code}
> du -h -d 1 
> /efs/tmp/hive/_resultscache_/results-9d89cc59-c99d-46a5-9d93-2b550576532012.0K
>  ./66356edb-57a6-4f0a-90cd-7d14d9e2b739
> 12.0K ./2db559fc-e32a-499b-86a7-a1247e9c972e
> 36.0K ./6fa8257f-6ef7-463f-8b5f-fb4d08ab0310
> 256.0K ./cdd64f99-20b8-4ef7-be03-3f638cc5be63
> 8.0K ./695f67da-db48-46be-90c9-58f6a796e5a7
> 12.0K ./c109c9d8-f436-46ea-9c5a-9f6afc3c8dcd
> 20.0K ./6eed7d64-f9d1-4516-a036-fdcdbabf65b5
> 72.0K ./aec5bb69-8163-4288-9ef3-bdaa74fe9349
> 12.0K ./353be5fe-1043-49c0-8712-7f42af2b1273
> 12.0K ./efb201cf-32b0-4fb1-9d43-ec566f27dae4
> 1.1T ./0fe343fb-6a89-4d28-b2fd-caed2f2e42f6
> 12.0K ./90bde86b-a72d-4d37-a117-2ada5701b806
> 12.0K ./808d348a-364a-4c4c-b63e-6da5f5b0cb89
> 12.0K ./4172f6b3-74b8-414e-b22b-fe061a569b08
> 1.1T .
> {code}
> This is most probably because *hive.query.results.cache.max.size* and 
> *hive.query.results.cache.max.entry.size* could be only enforced in the end 
> in 
> [setEntryValid|https://github.com/apache/hive/blob/399200af7cb11cf6ee3329ebdabe17792e5e7e85/ql/src/java/org/apache/hadoop/hive/ql/cache/results/QueryResultsCache.java#L516-L525]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to