[ 
https://issues.apache.org/jira/browse/HIVE-27737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-27737:
--------------------------------
    Description: 
[HIVE-17574|https://github.com/apache/hive/commit/26753ade2a130339940119c950b9c9af53e3d024]
 was about an optimization, where HDFS-based resources optionally were 
localized directly from the "original" hdfs folder instead of a tez session 
dir. This reduced the HDFS overhead, by introducing 
hive.resource.use.hdfs.location, so there are 2 cases:

1. hive.resource.use.hdfs.location=true
a) collect "HDFS temp files" and optimize their access: added files, added jars
b) collect local temp files and use the non-optimized session-based approach: 
added files, added jars, aux jars, reloadable aux jars

{code}
      // reference HDFS based resource directly, to use distribute cache 
efficiently.
      addHdfsResource(conf, tmpResources, LocalResourceType.FILE, 
getHdfsTempFilesFromConf(conf));
      // local resources are session based.
      tmpResources.addAll(
          addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
              getLocalTempFilesFromConf(conf), null).values()
      );
{code}

2. hive.resource.use.hdfs.location=false
a) original behavior: collect all jars in hs2's scope (added files, added jars, 
aux jars, reloadable aux jars) and put it to a session based directory
{code}
      // all resources including HDFS are session based.
      tmpResources.addAll(
          addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
              getTempFilesFromConf(conf), null).values()
      );
{code}

my proposal is related to 1)
let's say user is about to load an aux jar from hdfs and have it set in 
hive.aux.jars.path:
{code}
hive.aux.jars.path=file:///opt/some_local_jar.jar,hdfs:///tmp/some_distributed.jar
{code}

in this case: we can distinguish between file:// scheme resources and hdfs:// 
scheme resources:
- file scheme resources should fall into 1b), still be used from session dir
- hdfs scheme resources should fall into 1a), simply used by addHdfsResource

 
this needs a bit of attention at every usages of aux jars, because aux jars are 
e.g. supposed to be classloaded to HS2 sessions, so in case of an hdfs 
resource, it should be taken care of

  was:
[HIVE-17574|https://github.com/apache/hive/commit/26753ade2a130339940119c950b9c9af53e3d024]
 was about an optimization, where HDFS-based resources optionally were 
localized directly from the "original" hdfs folder instead of a tez session 
dir. This reduced the HDFS overhead, by introducing 
hive.resource.use.hdfs.location, so there are 2 cases:

1. hive.resource.use.hdfs.location=true
a) collect "HDFS temp files" and optimize their access: added files, added jars
b) collect local temp files and use the non-optimized session-based approach: 
added files, added jars, aux jars, reloadable aux jars

{code}
      // reference HDFS based resource directly, to use distribute cache 
efficiently.
      addHdfsResource(conf, tmpResources, LocalResourceType.FILE, 
getHdfsTempFilesFromConf(conf));
      // local resources are session based.
      tmpResources.addAll(
          addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
              getLocalTempFilesFromConf(conf), null).values()
      );
{code}

2. hive.resource.use.hdfs.location=false
a) original behavior: collect all jars in hs2's scope (added files, added jars, 
aux jars, reloadable aux jars) and put it to a session based directory
{code}
      // all resources including HDFS are session based.
      tmpResources.addAll(
          addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
              getTempFilesFromConf(conf), null).values()
      );
{code}

my proposal is related to 1)
let's say user is about to load an aux jar from hdfs and have it set in 
hive.aux.jars.path:
{code}
hive.aux.jars.path=file:///opt/some_local_jar.jar,hdfs:///tmp/some_distributed.jar
{code}

in this case: we can distinguish between file:// scheme resources and hdfs:// 
scheme resources:
- file scheme resources should fall into 1b), still be used from session dir
- hdfs scheme resources should fall into 1a), simply used by addHdfsResource

 


> Consider extending HIVE-17574 to aux jars
> -----------------------------------------
>
>                 Key: HIVE-27737
>                 URL: https://issues.apache.org/jira/browse/HIVE-27737
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Priority: Major
>
> [HIVE-17574|https://github.com/apache/hive/commit/26753ade2a130339940119c950b9c9af53e3d024]
>  was about an optimization, where HDFS-based resources optionally were 
> localized directly from the "original" hdfs folder instead of a tez session 
> dir. This reduced the HDFS overhead, by introducing 
> hive.resource.use.hdfs.location, so there are 2 cases:
> 1. hive.resource.use.hdfs.location=true
> a) collect "HDFS temp files" and optimize their access: added files, added 
> jars
> b) collect local temp files and use the non-optimized session-based approach: 
> added files, added jars, aux jars, reloadable aux jars
> {code}
>       // reference HDFS based resource directly, to use distribute cache 
> efficiently.
>       addHdfsResource(conf, tmpResources, LocalResourceType.FILE, 
> getHdfsTempFilesFromConf(conf));
>       // local resources are session based.
>       tmpResources.addAll(
>           addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
>               getLocalTempFilesFromConf(conf), null).values()
>       );
> {code}
> 2. hive.resource.use.hdfs.location=false
> a) original behavior: collect all jars in hs2's scope (added files, added 
> jars, aux jars, reloadable aux jars) and put it to a session based directory
> {code}
>       // all resources including HDFS are session based.
>       tmpResources.addAll(
>           addTempResources(conf, hdfsDirPathStr, LocalResourceType.FILE,
>               getTempFilesFromConf(conf), null).values()
>       );
> {code}
> my proposal is related to 1)
> let's say user is about to load an aux jar from hdfs and have it set in 
> hive.aux.jars.path:
> {code}
> hive.aux.jars.path=file:///opt/some_local_jar.jar,hdfs:///tmp/some_distributed.jar
> {code}
> in this case: we can distinguish between file:// scheme resources and hdfs:// 
> scheme resources:
> - file scheme resources should fall into 1b), still be used from session dir
> - hdfs scheme resources should fall into 1a), simply used by addHdfsResource
>  
> this needs a bit of attention at every usages of aux jars, because aux jars 
> are e.g. supposed to be classloaded to HS2 sessions, so in case of an hdfs 
> resource, it should be taken care of



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to