[
https://issues.apache.org/jira/browse/HIVE-27737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor updated HIVE-27737:
--------------------------------
Description:
[HIVE-17574|https://github.com/apache/hive/commit/26753ade2a130339940119c950b9c9af53e3d024]
was about an optimization, where HDFS-based resources optionally were
localized directly from the "original" hdfs folder instead of a tez session
dir. This reduced the HDFS overhead, by introducing
hive.resource.use.hdfs.location, so there are 2 cases:
1. hive.resource.use.hdfs.location=true
a) collect "HDFS temp files" and optimize their access: added files, added jars
b) collect local temp files and use the non-optimized session-based approach:
added files, added jars, aux jars, reloadable aux jars
{code}
// reference HDFS based resource directly, to use distribute cache
efficiently.
addHdfsResource(conf, tmpResources, LocalResourceType.FILE,
getHdfsTempFilesFromConf(conf));
// local resources are session based.
tmpResources.addAll(
addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
getLocalTempFilesFromConf(conf), null).values()
);
{code}
2. hive.resource.use.hdfs.location=false
a) original behavior: collect all jars in hs2's scope (added files, added jars,
aux jars, reloadable aux jars) and put it to a session based directory
{code}
// all resources including HDFS are session based.
tmpResources.addAll(
addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
getTempFilesFromConf(conf), null).values()
);
{code}
my proposal is related to 1)
let's say user is about to load an aux jar from hdfs and have it set in
hive.aux.jars.path:
{code}
hive.aux.jars.path=file:///opt/some_local_jar.jar,hdfs:///tmp/some_distributed.jar
{code}
in this case: we can distinguish between file:// scheme resources and hdfs://
scheme resources:
- file scheme resources should fall into 1b), still be used from session dir
- hdfs scheme resources should fall into 1a), simply used by addHdfsResource
> Consider extending HIVE-17574 to aux jars
> -----------------------------------------
>
> Key: HIVE-27737
> URL: https://issues.apache.org/jira/browse/HIVE-27737
> Project: Hive
> Issue Type: Improvement
> Reporter: László Bodor
> Priority: Major
>
> [HIVE-17574|https://github.com/apache/hive/commit/26753ade2a130339940119c950b9c9af53e3d024]
> was about an optimization, where HDFS-based resources optionally were
> localized directly from the "original" hdfs folder instead of a tez session
> dir. This reduced the HDFS overhead, by introducing
> hive.resource.use.hdfs.location, so there are 2 cases:
> 1. hive.resource.use.hdfs.location=true
> a) collect "HDFS temp files" and optimize their access: added files, added
> jars
> b) collect local temp files and use the non-optimized session-based approach:
> added files, added jars, aux jars, reloadable aux jars
> {code}
> // reference HDFS based resource directly, to use distribute cache
> efficiently.
> addHdfsResource(conf, tmpResources, LocalResourceType.FILE,
> getHdfsTempFilesFromConf(conf));
> // local resources are session based.
> tmpResources.addAll(
> addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
> getLocalTempFilesFromConf(conf), null).values()
> );
> {code}
> 2. hive.resource.use.hdfs.location=false
> a) original behavior: collect all jars in hs2's scope (added files, added
> jars, aux jars, reloadable aux jars) and put it to a session based directory
> {code}
> // all resources including HDFS are session based.
> tmpResources.addAll(
> addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
> getTempFilesFromConf(conf), null).values()
> );
> {code}
> my proposal is related to 1)
> let's say user is about to load an aux jar from hdfs and have it set in
> hive.aux.jars.path:
> {code}
> hive.aux.jars.path=file:///opt/some_local_jar.jar,hdfs:///tmp/some_distributed.jar
> {code}
> in this case: we can distinguish between file:// scheme resources and hdfs://
> scheme resources:
> - file scheme resources should fall into 1b), still be used from session dir
> - hdfs scheme resources should fall into 1a), simply used by addHdfsResource
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)