hudi-bot opened a new issue, #14537: URL: https://github.com/apache/hudi/issues/14537
Currently, we are able to query MOR tables that have base parquet files with inserts an logs files with updates. However, we are currently unable to query tables with insert only log files. Both _ro and _rt tables are returning 0 rows. However, hms does create the table and partitions for the table. One sample table is here: [https://s3.console.aws.amazon.com/s3/buckets/debug-hive-site?prefix=database/®ion=us-east-2] ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-2762 - Type: Improvement - Epic: https://issues.apache.org/jira/browse/HUDI-2749 --- ## Comments 25/Nov/21 03:44;rmahindra;In a test of kafka connect with the latest master, this issue seems to have re-emerged.;;; --- 11/Jan/22 21:29;alexey.kudinkin;[~rmahindra] is my understanding correct that the issue is about Hive not being able to read tables where there's *no* Base file, and inserts are in the Log files? ;;; --- 11/Jan/22 21:29;alexey.kudinkin;This should be addressed by HUDI-3082;;; --- 21/Jan/22 01:57;rmahindra;[~alexey.kudinkin] sorry i missed your question, yeah your understanding is correct.;;; --- 31/Jan/22 22:48;alexey.kudinkin;To reproduce: Just follow Kafka Connect guide;;; --- 18/Feb/22 09:59;mengtao;[~alexey.kudinkin] This problem is hive's problem. Hive will filter out all the file which startwith "." see org.apache.hadoop.hive.common.FileUtils in hive ;;; --- 23/Feb/22 03:33;xushiyan;[~rex_xiong] would you be interested in taking this up?;;; --- 24/Mar/22 19:20;xushiyan;[~alexey.kudinkin] [~mengtao] [~rex_xiong] do we have a solution here? as [~mengtao] mentioned, it's hive's problem, any suggestion to move this forward?;;; --- 25/Mar/22 16:40;alexey.kudinkin;We need to validate, as this should be fixed after recent changes to how we do file-listing for Hive as well.;;; --- 07/Apr/22 08:33;rex_xiong; as [~mengtao] mentioned, hive will filter out files which start with "." or "_", and I think It's not appropriate to simply modify on the hive side, because Many users may use this "feature" in their own production scenarios as hive treats these files as temporary files.;;; --- 08/Apr/22 10:26;codope;[~rex_xiong] [~alexey.kudinkin] [~mengtao] [~rmahindra] This issue should still be reproducible. As [~rex_xiong] mentioned, default path filters in Hive will filter out such files. However, we do have our own custom path filter (HoodieROTablePathFilter) but that only filters the base files. For insert-logs-only writes, we may have to write the index type to table config and then accept log files in RO view based on that config. This is a significant change. I don't see many people writing insertes only to log files. The primary use case is kafka-connect sync. I think in that case we can write a custom path filter for kafka-conect because there we can safely assume that all files are log files.;;; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
