andreitaleanu opened a new pull request #1784: URL: https://github.com/apache/hudi/pull/1784
## What is the purpose of this pull request This pull request fixes [HUDI-539](https://issues.apache.org/jira/browse/HUDI-539). ## Brief change log - Make `HoodieROTablePathFilter` implement `Configurable` To follow through the code and see why the table filter needs to implement Configurable please go through the following links: 1. [This](https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L125) is how Spark creates the filter 2. Filter is instantiated via [reflection](https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L159) 3. Which calls a [setConf](https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L133) method coming from the [Configurable](https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L72-L74) type ## Verify this pull request This pull request is already covered by existing unit tests, such as TestHoodieROTablePathFilter. Apart from it, I've tested this change by writing data in Azure Datalake and then reading it in the following environment: Databricks runtime: 6.6 Hudi version: 0.5.3 Spark version: 2.4.4 ```scala val stringData = """ |id,ts,day | 0,0,2020-01-01 | 1,1,2020-01-01 | 2,1,2020-01-01 | 3,2,2020-01-02 | 4,2,2020-01-02 |""".stripMargin val data = spark.read .option("header", value = true) .csv(sc.parallelize(stringData.lines.toSeq) .toDS()) data.write .format("org.apache.hudi") .options(getQuickstartWriteConfigs) .option(PRECOMBINE_FIELD_OPT_KEY, "ts") .option(RECORDKEY_FIELD_OPT_KEY, "id") .option(PARTITIONPATH_FIELD_OPT_KEY, "day") .option(TABLE_NAME, "mytable") .mode(Overwrite) .save("adl://mystore.azuredatalakestore.net/hudi") spark.read .format("org.apache.hudi") .load("adl://mystore.azuredatalakestore.net/hudi/*") .show() ``` ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
