andreitaleanu opened a new pull request #1784:
URL: https://github.com/apache/hudi/pull/1784


   ## What is the purpose of this pull request
   
   This pull request fixes 
[HUDI-539](https://issues.apache.org/jira/browse/HUDI-539).
   
   ## Brief change log
   - Make `HoodieROTablePathFilter` implement `Configurable`
   
   To follow through the code and see why the table filter needs to implement 
Configurable please go through the following links:
   1. 
[This](https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L125)
 is how Spark creates the filter
   2. Filter is instantiated via 
[reflection](https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L159)
   3. Which calls a 
[setConf](https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L133)
 method coming from the 
[Configurable](https://github.com/apache/hadoop-common/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L72-L74)
 type
   
   ## Verify this pull request
   
   This pull request is already covered by existing unit tests, such as 
TestHoodieROTablePathFilter. Apart from it, I've tested this change by writing 
data in Azure Datalake and then reading it in the following environment:
   
   Databricks runtime: 6.6
   Hudi version: 0.5.3
   Spark version: 2.4.4
   
   ```scala
   
   val stringData =
     """
       |id,ts,day
       | 0,0,2020-01-01
       | 1,1,2020-01-01
       | 2,1,2020-01-01
       | 3,2,2020-01-02
       | 4,2,2020-01-02
       |""".stripMargin
   
   val data = spark.read
     .option("header", value = true)
     .csv(sc.parallelize(stringData.lines.toSeq)
     .toDS())
   
   data.write
     .format("org.apache.hudi")
     .options(getQuickstartWriteConfigs)
     .option(PRECOMBINE_FIELD_OPT_KEY, "ts")
     .option(RECORDKEY_FIELD_OPT_KEY, "id")
     .option(PARTITIONPATH_FIELD_OPT_KEY, "day")
     .option(TABLE_NAME, "mytable")
     .mode(Overwrite)
     .save("adl://mystore.azuredatalakestore.net/hudi")
   
   spark.read
     .format("org.apache.hudi")
     .load("adl://mystore.azuredatalakestore.net/hudi/*")
     .show()
   ```
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to