jenu9417 opened a new issue #1528: [SUPPORT] Issue while writing to HDFS via 
hudi. Only `/.hoodie` folder is written.
URL: https://github.com/apache/incubator-hudi/issues/1528
 
 
   Hi, all.
   We are doing a POC experimenting with syncing our data in micro-batches from 
Kafka to HDFS. We are currently using the general Kafka consumer API's and 
converting them to DataSet and then writing it on to HDFS via hudi. We are 
facing some problems with this.
   
   ```
               // `items`  is List<String> containing data from kafka           
             final Dataset<Record> df = spark.createDataset(items, 
Encoders.STRING()).toDF()
                                                                     .map(new 
Mapper(), Encoders.bean(Record.class))
                                                 .filter(new 
Column("name").equalTo("aaa"));                    
   
             df.write().format("hudi")
                                        
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "id")
                                        
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "batch")
                                        
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
                                        .option(HoodieWriteConfig.TABLE_NAME, 
table).mode(SaveMode.Append)
                                        .save(output);
                                        //.parquet(output);
   ```
   
   a) When using save option to write dataset, only /.hoodie folder exists 
after writing. No actual data is present. From the logs we are not seeing any 
issues. The following set of lines are repeated continuously in the write 
phase. 
   ```
   91973 [Executor task launch worker for task 11129] INFO  
org.apache.spark.storage.BlockManager  - Found block rdd_36_624 locally
   91974 [Executor task launch worker for task 11128] INFO  
org.apache.spark.storage.BlockManager  - Found block rdd_36_623 locally
   91974 [Executor task launch worker for task 11129] INFO  
org.apache.spark.executor.Executor  - Finished task 624.0 in stage 19.0 (TID 
11129). 699 bytes result sent to driver
   91975 [dispatcher-event-loop-0] INFO  
org.apache.spark.scheduler.TaskSetManager  - Starting task 625.0 in stage 19.0 
(TID 11130, localhost, executor driver, partition 625, PROCESS_LOCAL, 7193 
bytes)
   91975 [Executor task launch worker for task 11130] INFO  
org.apache.spark.executor.Executor  - Running task 625.0 in stage 19.0 (TID 
11130)
   91975 [task-result-getter-0] INFO  org.apache.spark.scheduler.TaskSetManager 
 - Finished task 624.0 in stage 19.0 (TID 11129) in 16 ms on localhost 
(executor driver) (624/1500)
   91985 [Executor task launch worker for task 11128] INFO  
org.apache.spark.executor.Executor  - Finished task 623.0 in stage 19.0 (TID 
11128). 871 bytes result sent to driver
   91985 [dispatcher-event-loop-0] INFO  
org.apache.spark.scheduler.TaskSetManager  - Starting task 626.0 in stage 19.0 
(TID 11131, localhost, executor driver, partition 626, PROCESS_LOCAL, 7193 
bytes)
   91985 [task-result-getter-1] INFO  org.apache.spark.scheduler.TaskSetManager 
 - Finished task 623.0 in stage 19.0 (TID 11128) in 27 ms on localhost 
(executor driver) (625/1500)
   91986 [Executor task launch worker for task 11131] INFO  
org.apache.spark.executor.Executor  - Running task 626.0 in stage 19.0 (TID 
11131)
   ```
   We have verified that there is no issue with fetching data from kafka or 
creating data set. Only issue seems to be with the write.
   
   b) When using parquet option to write dataset, actual data is written in 
parquet file format in the output directory. But without  any partition 
folders. Is this expected? What is difference in save v/s parquet? Also, while 
querying this parquet data in Spark shell, via Spark SQL, I was not able to 
find any hudi meta fields. For eg:
   ```
   spark.sql("select id, name, `_hoodie_commit_time` from table1 limit 
5").show();
   ```
   The query was throwing error that there are no such field called 
_hoodie_commit_time
   
   c) Where can i find the meta data regarding the data currently present in 
hudi tables. ie, what are the new commits? When was the last commit? etc., From 
the documentation it seemed these data are managed by hudi.
   
   d) How is data compaction managed by hudi? Is there any background jobs 
running?
   
   Sorry, if these are naive questions. But we are completely new to this. 
Also, it would be helpful if someone could point us to a little detailed 
documentation on these.
   
   Thanks.
   
   
   **Steps to reproduce the behavior:**
   
   1. Code snippet used for write has been shared.
   
   
   **Expected behavior**
   
   Currently, when using write only `/.hoodie` folder alone is being written 
without any data. Expected behaviour is Data should also be written.
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :  0.5.2-incubating
   
   * Spark version :  2.4.0
   
   * Hive version : - 
   
   * Hadoop version : 2.9.2
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : No
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to