hudi-bot opened a new issue, #14537:
URL: https://github.com/apache/hudi/issues/14537

   Currently, we are able to query MOR tables that have base parquet files with 
inserts an logs files with updates. However, we are currently unable to query 
tables with insert only log files. Both _ro and _rt tables are returning 0 
rows. However, hms does create the table and partitions for the table. 
   
    
   
   One sample table is here:
   
   
[https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/&region=us-east-2]
   
    
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-2762
   - Type: Improvement
   - Epic: https://issues.apache.org/jira/browse/HUDI-2749
   
   
   ---
   
   
   ## Comments
   
   25/Nov/21 03:44;rmahindra;In a test of kafka connect with the latest master, 
this issue seems to have re-emerged.;;;
   
   ---
   
   11/Jan/22 21:29;alexey.kudinkin;[~rmahindra] is my understanding correct 
that the issue is about Hive not being able to read tables where there's *no* 
Base file, and inserts are in the Log files? ;;;
   
   ---
   
   11/Jan/22 21:29;alexey.kudinkin;This should be addressed by HUDI-3082;;;
   
   ---
   
   21/Jan/22 01:57;rmahindra;[~alexey.kudinkin] sorry i missed your question, 
yeah your understanding is correct.;;;
   
   ---
   
   31/Jan/22 22:48;alexey.kudinkin;To reproduce: Just follow Kafka Connect 
guide;;;
   
   ---
   
   18/Feb/22 09:59;mengtao;[~alexey.kudinkin]  This problem is hive's problem. 
Hive will filter out all the file which startwith "."
   
    see org.apache.hadoop.hive.common.FileUtils in hive
   
    ;;;
   
   ---
   
   23/Feb/22 03:33;xushiyan;[~rex_xiong] would you be interested in taking this 
up?;;;
   
   ---
   
   24/Mar/22 19:20;xushiyan;[~alexey.kudinkin] [~mengtao] [~rex_xiong] do we 
have a solution here? as [~mengtao] mentioned, it's hive's problem, any 
suggestion to move this forward?;;;
   
   ---
   
   25/Mar/22 16:40;alexey.kudinkin;We need to validate, as this should be fixed 
after recent changes to how we do file-listing for Hive as well.;;;
   
   ---
   
   07/Apr/22 08:33;rex_xiong; as [~mengtao] mentioned, hive will filter out 
files which start with "." or "_", and I think It's not appropriate to
   
   simply modify on the hive side, because Many users may use this "feature" in 
their own production scenarios as hive treats these files as temporary files.;;;
   
   ---
   
   08/Apr/22 10:26;codope;[~rex_xiong] [~alexey.kudinkin] [~mengtao] 
[~rmahindra] 
   
   This issue should still be reproducible. As [~rex_xiong] mentioned, default 
path filters in Hive will filter out such files.
   
   However, we do have our own custom path filter (HoodieROTablePathFilter) but 
that only filters the base files. For insert-logs-only writes, we may have to 
write the index type to table config and then accept log files in RO view based 
on that config. This is a significant change. 
   
   I don't see many people writing insertes only to log files. The primary use 
case is kafka-connect sync. I think in that case we can write a custom path 
filter for kafka-conect because there we can safely assume that all files are 
log files.;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to