[jira] [Updated] (FLINK-36594) HiveCatalog should set HiveConf.hiveSiteLocation back

slankka (Jira) Sat, 09 Nov 2024 20:38:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-36594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


slankka updated FLINK-36594:
----------------------------
    Description: 
h3. {*}background{*}:

Recently, I'm using HiveCatalog and Hudi sync to HMS.

HiveCatalog can cause subsequently failure of Hive configuration retrieval. In 
my case, Hudi cannot get hive-site conf provided in classpath. 

 
h3. *TLDR:*

I mean, once HiveCatalog initialized, it turn it off by setting 
*HiveConf.hiveSiteLocation* to null,

then any instance of HiveConf will never load hive-site.xml, no matter what 
user puts it on classpath, such as yarn provided. 

 
h3. {*}Addressing{*}:

HiveCatalog can load hive-site.xml itself without this variable , however the 
normal code after that, is still assuming HiveConf 'searches' hive-site.xml 
from classpath. 

Related change:  https://issues.apache.org/jira/browse/FLINK-22092

 
h3. *Cause:*

Only if you addResource explicitly, set it back, or Hive search it from user 
uber jar which need another effort.

My point is, {+}big data developers will be confused about to provide 
core-site.xml, hive-site.xml, hbase-site.xml and so on{+}. On the other side, 
developers of bigdata framework search it from here and there, and could not 
make sure it's right.

AS consequence, user and cloud provider put their xxx-site.xml everywhere:
 # /etc/hive/conf, /etc/hadoop/conf
 # FLINK_HOME/lib, SPARK_HOME/conf
 # yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ )
 # packed in their uber jar
 # --files of Apache spark, --yarnship hive-site.xml (works)

Due to the difference of deployment: yarn-per-job and yarn-application, the 
main() of their application could run from different places.

The simplist way to provided xxx-site.xml is both client side classpath and 
their container classpath (root path). By the way, if I am cloud infrastructure 
provider, I like to put it on 1. and 2. and 3; if I am flink users, I do not 
trust them, I packed in my jar and ask cloud provider to give me xxx-site.xml.

 

In addition, the code below are similar at using their private method 
*findConfigFile* to search *hiveSiteLocation* from classpath
 * org.apache.hadoop.hive.conf.HiveConf
 * org.apache.hadoop.hive.metastore.conf.MetastoreConf

 
{*}Conclusion{*}:
 # HiveConf findConfigFile and cache hiveSiteLocation only once during class 
intialization.
 # MetastoreConf will searches hiveSiteLocation again even somebody set it to 
null. (It's better)
 # both HiveConf and MetastoreConf can recognize hive-site.xml from classpath 
first level. eg: "lib/hive-site.xml" is invalid.

 
{code:java}
class org.apache.hadoop.hive.metastore.conf.MetastoreConf

private MetastoreConf() {
  throw new RuntimeException("You should never be creating one of these!");
}

 
public static Configuration newMetastoreConf() {
...
  if(hiveSiteURL == null) {
    hiveSiteURL = findConfigFile(classLoader, "hive-site.xml");
  }
...
}{code}
 
{code:java}
class org.apache.hadoop.hive.conf.HiveConf 
//HiveConf static initialization code try to search hive-site.xml, and only 
once.

static {
  hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true);
}
...

private void initialize(Class<?> cls) {
  ...
  if (hiveSiteURL != null) {
    addResource(hiveSiteURL);
  }
  ...
}{code}
 
{code:java}
String name            = "myhive";
String defaultDatabase = "mydatabase";
String hiveConfDir     = "/opt/hive-conf";

HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
tableEnv.registerCatalog("myhive", hive);

// set the HiveCatalog as the current catalog of the session
tableEnv.useCatalog("myhive"); {code}
after running code above:
{code:java}
//Another framework who are using hive naturely:

HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class); 

// or directly

HiveConf hiveConf = new HiveConf(); {code}
The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause 
configuration loading failure.

 

Example code from HiveSyncConfig of Apache Hudi:
{code:java}
public HiveSyncConfig(Properties props, Configuration hadoopConf) {
    super(props, hadoopConf);
    HiveConf hiveConf = new HiveConf();
    // HiveConf needs to load Hadoop conf to allow instantiation via 
AWSGlueClientFactory
    hiveConf.addResource(hadoopConf);
    setHadoopConf(hiveConf);
    validateParameters();
} {code}
 

The temporary fix of this issue is to search again :)
{code:java}
HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE));
 
HiveConf hiveConf = new HiveConf();{code}
 

 

  was:
Recently, I'm using HiveCatalog and Hudi sync to HMS.

HiveCatalog can cause subsequently failure of Hive configuration retrieval. In 
my case, Hudi cannot get hive-site conf provided in classpath. 

I mean, HiveCatalog turn it off by set *HiveConf.hiveSiteLocation* to null, 
then any instance of HiveConf will never load hive-site.xml which user put it 
on classpath, such as yarn provided. 

HiveCatalog can load hive-site.xml itself without this variable , however the 
normal code after that, is still assuming HiveConf 'searches' hive-site.xml 
from classpath. 

Related change:  https://issues.apache.org/jira/browse/FLINK-22092

Only if you addResource explicitly, set it back, or Hive search it from user 
uber jar which need another effort.

My point is, {+}big data developers will be confused about to provide 
core-site.xml, hive-site.xml, hbase-site.xml and so on{+}. On the other side, 
developers of bigdata framework search it from here and there, and could not 
make sure it's right.

AS consequence, user and cloud provider put their xxx-site.xml everywhere:
 # /etc/hive/conf, /etc/hadoop/conf
 # FLINK_HOME/lib, SPARK_HOME/conf
 # yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ )
 # packed in their uber jar
 # --files of Apache spark, --yarnship hive-site.xml (works)

Due to the difference of deployment: yarn-per-job and yarn-application, the 
main() of their application could run from different places.

The simplist way to provided xxx-site.xml is both client side classpath and 
their container classpath (root path). By the way, if I am cloud infrastructure 
provider, I like to put it on 1. and 2. and 3; if I am flink users, I do not 
trust them, I packed in my jar and ask cloud provider to give me xxx-site.xml.

 

In addition, the code below are similar at using their private method 
*findConfigFile* to search *hiveSiteLocation* from classpath
 * org.apache.hadoop.hive.conf.HiveConf
 * org.apache.hadoop.hive.metastore.conf.MetastoreConf

 
{*}Conclusion{*}:
 # HiveConf findConfigFile and cache hiveSiteLocation only once during class 
intialization.
 # MetastoreConf will searches hiveSiteLocation again even somebody set it to 
null. (It's better)
 # both HiveConf and MetastoreConf can recognize hive-site.xml from classpath 
first level. eg: "lib/hive-site.xml" is invalid.

 
{code:java}
class org.apache.hadoop.hive.metastore.conf.MetastoreConf

private MetastoreConf() {
  throw new RuntimeException("You should never be creating one of these!");
}

 
public static Configuration newMetastoreConf() {
...
  if(hiveSiteURL == null) {
    hiveSiteURL = findConfigFile(classLoader, "hive-site.xml");
  }
...
}{code}
 
{code:java}
class org.apache.hadoop.hive.conf.HiveConf 
//HiveConf static initialization code try to search hive-site.xml, and only 
once.

static {
  hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true);
}
...

private void initialize(Class<?> cls) {
  ...
  if (hiveSiteURL != null) {
    addResource(hiveSiteURL);
  }
  ...
}{code}
 
{code:java}
String name            = "myhive";
String defaultDatabase = "mydatabase";
String hiveConfDir     = "/opt/hive-conf";

HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
tableEnv.registerCatalog("myhive", hive);

// set the HiveCatalog as the current catalog of the session
tableEnv.useCatalog("myhive"); {code}
after running code above:
{code:java}
//Another framework who are using hive naturely:

HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class); 

// or directly

HiveConf hiveConf = new HiveConf(); {code}
The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause 
configuration loading failure.

 

Example code from HiveSyncConfig of Apache Hudi:
{code:java}
public HiveSyncConfig(Properties props, Configuration hadoopConf) {
    super(props, hadoopConf);
    HiveConf hiveConf = new HiveConf();
    // HiveConf needs to load Hadoop conf to allow instantiation via 
AWSGlueClientFactory
    hiveConf.addResource(hadoopConf);
    setHadoopConf(hiveConf);
    validateParameters();
} {code}
 

The temporary fix of this issue is to search again :)
{code:java}
HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE));
 
HiveConf hiveConf = new HiveConf();{code}
 

 


> HiveCatalog should set HiveConf.hiveSiteLocation back
> -----------------------------------------------------
>
>                 Key: FLINK-36594
>                 URL: https://issues.apache.org/jira/browse/FLINK-36594
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Hive
>    Affects Versions: 1.20.1
>            Reporter: slankka
>            Priority: Minor
>              Labels: pull-request-available
>
> h3. {*}background{*}:
> Recently, I'm using HiveCatalog and Hudi sync to HMS.
> HiveCatalog can cause subsequently failure of Hive configuration retrieval. 
> In my case, Hudi cannot get hive-site conf provided in classpath. 
>  
> h3. *TLDR:*
> I mean, once HiveCatalog initialized, it turn it off by setting 
> *HiveConf.hiveSiteLocation* to null,
> then any instance of HiveConf will never load hive-site.xml, no matter what 
> user puts it on classpath, such as yarn provided. 
>  
> h3. {*}Addressing{*}:
> HiveCatalog can load hive-site.xml itself without this variable , however the 
> normal code after that, is still assuming HiveConf 'searches' hive-site.xml 
> from classpath. 
> Related change:  https://issues.apache.org/jira/browse/FLINK-22092
>  
> h3. *Cause:*
> Only if you addResource explicitly, set it back, or Hive search it from user 
> uber jar which need another effort.
> My point is, {+}big data developers will be confused about to provide 
> core-site.xml, hive-site.xml, hbase-site.xml and so on{+}. On the other side, 
> developers of bigdata framework search it from here and there, and could not 
> make sure it's right.
> AS consequence, user and cloud provider put their xxx-site.xml everywhere:
>  # /etc/hive/conf, /etc/hadoop/conf
>  # FLINK_HOME/lib, SPARK_HOME/conf
>  # yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ )
>  # packed in their uber jar
>  # --files of Apache spark, --yarnship hive-site.xml (works)
> Due to the difference of deployment: yarn-per-job and yarn-application, the 
> main() of their application could run from different places.
> The simplist way to provided xxx-site.xml is both client side classpath and 
> their container classpath (root path). By the way, if I am cloud 
> infrastructure provider, I like to put it on 1. and 2. and 3; if I am flink 
> users, I do not trust them, I packed in my jar and ask cloud provider to give 
> me xxx-site.xml.
>  
> In addition, the code below are similar at using their private method 
> *findConfigFile* to search *hiveSiteLocation* from classpath
>  * org.apache.hadoop.hive.conf.HiveConf
>  * org.apache.hadoop.hive.metastore.conf.MetastoreConf
>  
> {*}Conclusion{*}:
>  # HiveConf findConfigFile and cache hiveSiteLocation only once during class 
> intialization.
>  # MetastoreConf will searches hiveSiteLocation again even somebody set it to 
> null. (It's better)
>  # both HiveConf and MetastoreConf can recognize hive-site.xml from classpath 
> first level. eg: "lib/hive-site.xml" is invalid.
>  
> {code:java}
> class org.apache.hadoop.hive.metastore.conf.MetastoreConf
> private MetastoreConf() {
>   throw new RuntimeException("You should never be creating one of these!");
> }
>  
> public static Configuration newMetastoreConf() {
> ...
>   if(hiveSiteURL == null) {
>     hiveSiteURL = findConfigFile(classLoader, "hive-site.xml");
>   }
> ...
> }{code}
>  
> {code:java}
> class org.apache.hadoop.hive.conf.HiveConf 
> //HiveConf static initialization code try to search hive-site.xml, and only 
> once.
> static {
>   hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true);
> }
> ...
> private void initialize(Class<?> cls) {
>   ...
>   if (hiveSiteURL != null) {
>     addResource(hiveSiteURL);
>   }
>   ...
> }{code}
>  
> {code:java}
> String name            = "myhive";
> String defaultDatabase = "mydatabase";
> String hiveConfDir     = "/opt/hive-conf";
> HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
> tableEnv.registerCatalog("myhive", hive);
> // set the HiveCatalog as the current catalog of the session
> tableEnv.useCatalog("myhive"); {code}
> after running code above:
> {code:java}
> //Another framework who are using hive naturely:
> HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class); 
> // or directly
> HiveConf hiveConf = new HiveConf(); {code}
> The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause 
> configuration loading failure.
>  
> Example code from HiveSyncConfig of Apache Hudi:
> {code:java}
> public HiveSyncConfig(Properties props, Configuration hadoopConf) {
>     super(props, hadoopConf);
>     HiveConf hiveConf = new HiveConf();
>     // HiveConf needs to load Hadoop conf to allow instantiation via 
> AWSGlueClientFactory
>     hiveConf.addResource(hadoopConf);
>     setHadoopConf(hiveConf);
>     validateParameters();
> } {code}
>  
> The temporary fix of this issue is to search again :)
> {code:java}
> HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE));
>  
> HiveConf hiveConf = new HiveConf();{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-36594) HiveCatalog should set HiveConf.hiveSiteLocation back

Reply via email to