[GitHub] [hudi] JoshuaZhuCN opened a new issue #3981: [SUPPORT] If the HUDI table contains only log files, the Spark Datasource cannot obtain data in snapshot mode

GitBox Thu, 11 Nov 2021 21:52:15 -0800


JoshuaZhuCN opened a new issue #3981:
URL: https://github.com/apache/hudi/issues/3981



   <p>  When the HBase index is used for the Hudi table, after initializing the 
data with insert or upsert, only log files will be generated in the directory, 
and there is no parquet file.
   <p>  At this time, the spark datasource is used for snapshot reading, and 
the data cannot be obtained. The data can only be obtained through incremental 
reading.
   <p>  According to the official introduction, when reading the Hudi table 
with a snapshot, Parque and log will be merged. At this time, why can't the 
data be read when there is only log. 
   <p>Is this a bug?
   
   
   **Steps to reproduce the behavior:**
   
   1. Create hudi table with hbase index
   2. Use insert or upsert to initialize data
   3. Check whether there are only log files in the Hudi table directory
   4. Read data using snapshot mode and incremental mode respectively
   
   
   **Environment Description**
   
   * Hudi version : 0.9.0
   * Spark version : 2.4.7
   * Hive version : ~
   * Hadoop version : 3.1.1
   * Storage (HDFS/S3/GCS..) : HDFS
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   ```val conf = new SparkConf()
               .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
               conf.setMaster("local[2]")
   
           val spark = SparkSession
               .builder()
               .config(conf)
               .getOrCreate()
   
           println("=================Snapshot Read===============")
           val dfSnapshot = spark
               .read
               .format("hudi")
               .option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
               .load("C:\\hudi_data\\baikal\\oms_order_info_hbase\\default\\*")
           dfSnapshot.show(1, false)
           println("================================================")
           println("")
           println("=================Incremental Read===============")
           val dfIncremental = spark
               .read
               .format("hudi")
               .option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
               .option(DataSourceReadOptions.BEGIN_INSTANTTIME.key(), 
"19800101000000")
               .load("C:\\hudi_data\\baikal\\oms_order_info_hbase\\default\\*")
           dfIncremental.show(1, false)
           println("================================================")
           spark.close()```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] JoshuaZhuCN opened a new issue #3981: [SUPPORT] If the HUDI table contains only log files, the Spark Datasource cannot obtain data in snapshot mode

Reply via email to