[
https://issues.apache.org/jira/browse/FLINK-36594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
slankka updated FLINK-36594:
----------------------------
Description:
h3. {*}background{*}:
Recently, I'm using HiveCatalog and Hudi sync to HMS.
HiveCatalog can cause subsequently failure of Hive configuration retrieval. In
my case, Hudi cannot get hive-site conf provided in classpath.
h3. *TLDR:*
I mean, once HiveCatalog initialized, it turn it off by setting
*HiveConf.hiveSiteLocation* to null,
then any instance of HiveConf will never load hive-site.xml, no matter what
user puts it on classpath, such as yarn provided.
h3. {*}Addressing{*}:
HiveCatalog can load hive-site.xml itself without this variable , however the
normal code after that, is still assuming HiveConf 'searches' hive-site.xml
from classpath.
Related change: https://issues.apache.org/jira/browse/FLINK-22092
h3. *Cause:*
Only if you addResource explicitly, set it back, or Hive search it from user
uber jar which need another effort.
My point is, {+}big data developers will be confused about to provide
core-site.xml, hive-site.xml, hbase-site.xml and so on{+}. On the other side,
developers of bigdata framework search it from here and there, and could not
make sure it's right.
AS consequence, user and cloud provider put their xxx-site.xml everywhere:
# /etc/hive/conf, /etc/hadoop/conf
# FLINK_HOME/lib, SPARK_HOME/conf
# yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ )
# packed in their uber jar
# --files of Apache spark, --yarnship hive-site.xml (works)
Due to the difference of deployment: yarn-per-job and yarn-application, the
main() of their application could run from different places.
The simplist way to provided xxx-site.xml is both client side classpath and
their container classpath (root path). By the way, if I am cloud infrastructure
provider, I like to put it on 1. and 2. and 3; if I am flink users, I do not
trust them, I packed in my jar and ask cloud provider to give me xxx-site.xml.
In addition, the code below are similar at using their private method
*findConfigFile* to search *hiveSiteLocation* from classpath
* org.apache.hadoop.hive.conf.HiveConf
* org.apache.hadoop.hive.metastore.conf.MetastoreConf
{*}Conclusion{*}:
# HiveConf findConfigFile and cache hiveSiteLocation only once during class
intialization.
# MetastoreConf will searches hiveSiteLocation again even somebody set it to
null. (It's better)
# both HiveConf and MetastoreConf can recognize hive-site.xml from classpath
first level. eg: "lib/hive-site.xml" is invalid.
{code:java}
class org.apache.hadoop.hive.metastore.conf.MetastoreConf
private MetastoreConf() {
throw new RuntimeException("You should never be creating one of these!");
}
public static Configuration newMetastoreConf() {
...
if(hiveSiteURL == null) {
hiveSiteURL = findConfigFile(classLoader, "hive-site.xml");
}
...
}{code}
{code:java}
class org.apache.hadoop.hive.conf.HiveConf
//HiveConf static initialization code try to search hive-site.xml, and only
once.
static {
hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true);
}
...
private void initialize(Class<?> cls) {
...
if (hiveSiteURL != null) {
addResource(hiveSiteURL);
}
...
}{code}
{code:java}
String name = "myhive";
String defaultDatabase = "mydatabase";
String hiveConfDir = "/opt/hive-conf";
HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
tableEnv.registerCatalog("myhive", hive);
// set the HiveCatalog as the current catalog of the session
tableEnv.useCatalog("myhive"); {code}
after running code above:
{code:java}
//Another framework who are using hive naturely:
HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class);
// or directly
HiveConf hiveConf = new HiveConf(); {code}
The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause
configuration loading failure.
Example code from HiveSyncConfig of Apache Hudi:
{code:java}
public HiveSyncConfig(Properties props, Configuration hadoopConf) {
super(props, hadoopConf);
HiveConf hiveConf = new HiveConf();
// HiveConf needs to load Hadoop conf to allow instantiation via
AWSGlueClientFactory
hiveConf.addResource(hadoopConf);
setHadoopConf(hiveConf);
validateParameters();
} {code}
The temporary fix of this issue is to search again :)
{code:java}
HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE));
HiveConf hiveConf = new HiveConf();{code}
was:
Recently, I'm using HiveCatalog and Hudi sync to HMS.
HiveCatalog can cause subsequently failure of Hive configuration retrieval. In
my case, Hudi cannot get hive-site conf provided in classpath.
I mean, HiveCatalog turn it off by set *HiveConf.hiveSiteLocation* to null,
then any instance of HiveConf will never load hive-site.xml which user put it
on classpath, such as yarn provided.
HiveCatalog can load hive-site.xml itself without this variable , however the
normal code after that, is still assuming HiveConf 'searches' hive-site.xml
from classpath.
Related change: https://issues.apache.org/jira/browse/FLINK-22092
Only if you addResource explicitly, set it back, or Hive search it from user
uber jar which need another effort.
My point is, {+}big data developers will be confused about to provide
core-site.xml, hive-site.xml, hbase-site.xml and so on{+}. On the other side,
developers of bigdata framework search it from here and there, and could not
make sure it's right.
AS consequence, user and cloud provider put their xxx-site.xml everywhere:
# /etc/hive/conf, /etc/hadoop/conf
# FLINK_HOME/lib, SPARK_HOME/conf
# yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ )
# packed in their uber jar
# --files of Apache spark, --yarnship hive-site.xml (works)
Due to the difference of deployment: yarn-per-job and yarn-application, the
main() of their application could run from different places.
The simplist way to provided xxx-site.xml is both client side classpath and
their container classpath (root path). By the way, if I am cloud infrastructure
provider, I like to put it on 1. and 2. and 3; if I am flink users, I do not
trust them, I packed in my jar and ask cloud provider to give me xxx-site.xml.
In addition, the code below are similar at using their private method
*findConfigFile* to search *hiveSiteLocation* from classpath
* org.apache.hadoop.hive.conf.HiveConf
* org.apache.hadoop.hive.metastore.conf.MetastoreConf
{*}Conclusion{*}:
# HiveConf findConfigFile and cache hiveSiteLocation only once during class
intialization.
# MetastoreConf will searches hiveSiteLocation again even somebody set it to
null. (It's better)
# both HiveConf and MetastoreConf can recognize hive-site.xml from classpath
first level. eg: "lib/hive-site.xml" is invalid.
{code:java}
class org.apache.hadoop.hive.metastore.conf.MetastoreConf
private MetastoreConf() {
throw new RuntimeException("You should never be creating one of these!");
}
public static Configuration newMetastoreConf() {
...
if(hiveSiteURL == null) {
hiveSiteURL = findConfigFile(classLoader, "hive-site.xml");
}
...
}{code}
{code:java}
class org.apache.hadoop.hive.conf.HiveConf
//HiveConf static initialization code try to search hive-site.xml, and only
once.
static {
hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true);
}
...
private void initialize(Class<?> cls) {
...
if (hiveSiteURL != null) {
addResource(hiveSiteURL);
}
...
}{code}
{code:java}
String name = "myhive";
String defaultDatabase = "mydatabase";
String hiveConfDir = "/opt/hive-conf";
HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
tableEnv.registerCatalog("myhive", hive);
// set the HiveCatalog as the current catalog of the session
tableEnv.useCatalog("myhive"); {code}
after running code above:
{code:java}
//Another framework who are using hive naturely:
HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class);
// or directly
HiveConf hiveConf = new HiveConf(); {code}
The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause
configuration loading failure.
Example code from HiveSyncConfig of Apache Hudi:
{code:java}
public HiveSyncConfig(Properties props, Configuration hadoopConf) {
super(props, hadoopConf);
HiveConf hiveConf = new HiveConf();
// HiveConf needs to load Hadoop conf to allow instantiation via
AWSGlueClientFactory
hiveConf.addResource(hadoopConf);
setHadoopConf(hiveConf);
validateParameters();
} {code}
The temporary fix of this issue is to search again :)
{code:java}
HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE));
HiveConf hiveConf = new HiveConf();{code}
> HiveCatalog should set HiveConf.hiveSiteLocation back
> -----------------------------------------------------
>
> Key: FLINK-36594
> URL: https://issues.apache.org/jira/browse/FLINK-36594
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Hive
> Affects Versions: 1.20.1
> Reporter: slankka
> Priority: Minor
> Labels: pull-request-available
>
> h3. {*}background{*}:
> Recently, I'm using HiveCatalog and Hudi sync to HMS.
> HiveCatalog can cause subsequently failure of Hive configuration retrieval.
> In my case, Hudi cannot get hive-site conf provided in classpath.
>
> h3. *TLDR:*
> I mean, once HiveCatalog initialized, it turn it off by setting
> *HiveConf.hiveSiteLocation* to null,
> then any instance of HiveConf will never load hive-site.xml, no matter what
> user puts it on classpath, such as yarn provided.
>
> h3. {*}Addressing{*}:
> HiveCatalog can load hive-site.xml itself without this variable , however the
> normal code after that, is still assuming HiveConf 'searches' hive-site.xml
> from classpath.
> Related change: https://issues.apache.org/jira/browse/FLINK-22092
>
> h3. *Cause:*
> Only if you addResource explicitly, set it back, or Hive search it from user
> uber jar which need another effort.
> My point is, {+}big data developers will be confused about to provide
> core-site.xml, hive-site.xml, hbase-site.xml and so on{+}. On the other side,
> developers of bigdata framework search it from here and there, and could not
> make sure it's right.
> AS consequence, user and cloud provider put their xxx-site.xml everywhere:
> # /etc/hive/conf, /etc/hadoop/conf
> # FLINK_HOME/lib, SPARK_HOME/conf
> # yarn.provided.lib.dir ( resource prefix ./lib, ./plugin/ )
> # packed in their uber jar
> # --files of Apache spark, --yarnship hive-site.xml (works)
> Due to the difference of deployment: yarn-per-job and yarn-application, the
> main() of their application could run from different places.
> The simplist way to provided xxx-site.xml is both client side classpath and
> their container classpath (root path). By the way, if I am cloud
> infrastructure provider, I like to put it on 1. and 2. and 3; if I am flink
> users, I do not trust them, I packed in my jar and ask cloud provider to give
> me xxx-site.xml.
>
> In addition, the code below are similar at using their private method
> *findConfigFile* to search *hiveSiteLocation* from classpath
> * org.apache.hadoop.hive.conf.HiveConf
> * org.apache.hadoop.hive.metastore.conf.MetastoreConf
>
> {*}Conclusion{*}:
> # HiveConf findConfigFile and cache hiveSiteLocation only once during class
> intialization.
> # MetastoreConf will searches hiveSiteLocation again even somebody set it to
> null. (It's better)
> # both HiveConf and MetastoreConf can recognize hive-site.xml from classpath
> first level. eg: "lib/hive-site.xml" is invalid.
>
> {code:java}
> class org.apache.hadoop.hive.metastore.conf.MetastoreConf
> private MetastoreConf() {
> throw new RuntimeException("You should never be creating one of these!");
> }
>
> public static Configuration newMetastoreConf() {
> ...
> if(hiveSiteURL == null) {
> hiveSiteURL = findConfigFile(classLoader, "hive-site.xml");
> }
> ...
> }{code}
>
> {code:java}
> class org.apache.hadoop.hive.conf.HiveConf
> //HiveConf static initialization code try to search hive-site.xml, and only
> once.
> static {
> hiveSiteURL = findConfigFile(classLoader, "hive-site.xml", true);
> }
> ...
> private void initialize(Class<?> cls) {
> ...
> if (hiveSiteURL != null) {
> addResource(hiveSiteURL);
> }
> ...
> }{code}
>
> {code:java}
> String name = "myhive";
> String defaultDatabase = "mydatabase";
> String hiveConfDir = "/opt/hive-conf";
> HiveCatalog hive = new HiveCatalog(name, defaultDatabase, hiveConfDir);
> tableEnv.registerCatalog("myhive", hive);
> // set the HiveCatalog as the current catalog of the session
> tableEnv.useCatalog("myhive"); {code}
> after running code above:
> {code:java}
> //Another framework who are using hive naturely:
> HiveConf hiveConf = new HiveConf(hadoopConf, HiveConf.class);
> // or directly
> HiveConf hiveConf = new HiveConf(); {code}
> The hiveConf *DOES NOT* load hive-site.xml from classpath, which will cause
> configuration loading failure.
>
> Example code from HiveSyncConfig of Apache Hudi:
> {code:java}
> public HiveSyncConfig(Properties props, Configuration hadoopConf) {
> super(props, hadoopConf);
> HiveConf hiveConf = new HiveConf();
> // HiveConf needs to load Hadoop conf to allow instantiation via
> AWSGlueClientFactory
> hiveConf.addResource(hadoopConf);
> setHadoopConf(hiveConf);
> validateParameters();
> } {code}
>
> The temporary fix of this issue is to search again :)
> {code:java}
> HiveConf.setHiveSiteLocation(classLoader.getResource(HiveCatalog.HIVE_SITE_FILE));
>
> HiveConf hiveConf = new HiveConf();{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)