chb777777 opened a new issue #3301: URL: https://github.com/apache/hudi/issues/3301
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I want to perform incremental load on around 10 terabyte of data by subfolders using wildcard (*) similar to how we can do it in spark. It appears that wildcard is not recognized. e.g. --hoodie-conf hoodie.deltastreamer.source.dfs.root s3://mybucket/root/folder1* A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. Create s3 bucket with some subfolders 2. add dummy parquet files to load in each subfolder 3. submit EMR job with hudi config --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://mylake-landing/FULL/CP_TEST_TABLE/202005*/* 4. **Expected behavior** files with full paths are loaded and other files are not loaded. such as s3://mylake-landing/FULL/CP_TEST_TABLE/20200520, s3://mylake-landing/FULL/CP_TEST_TABLE/20200521, s3://mylake-landing/FULL/CP_TEST_TABLE/20200522 **Environment Description** * Hudi version : 0.7 * Spark version : 2.4.7 * Hive version : 2.3.7 * Hadoop version : Amazon 2.10.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no * EMR Version: 5.33.0 **Additional context** How does one go about running incremental load for large historical load in hudi? Any regex features? **Stacktrace** ``` ApplicationMaster host: ip-10-67-29-149.ec2.internal ApplicationMaster RPC port: 43459 queue: root.hadoop start time: 1626707782404 final status: FAILED tracking URL: http://ip-10-67-29-149.ec2.internal:20888/proxy/application_1626707666820_0001/ user: hadoop 21/07/19 15:16:47 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieIOException: Unable to read from source from checkpoint: Option{val=0} at org.apache.hudi.utilities.sources.helpers.DFSPathSelector.getNextFilePathsAndMaxModificationTime(DFSPathSelector.java:147) at org.apache.hudi.utilities.sources.helpers.DFSPathSelector.getNextFilePathsAndMaxModificationTime(DFSPathSelector.java:102) at org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48) at org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43) at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75) at org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688) Caused by: java.io.FileNotFoundException: File s3://mylake-landing/FULL/CP_TEST_TABLE/202005* does not exist. at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:709) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1849) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1899) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1893) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:480) at org.apache.hudi.utilities.sources.helpers.DFSPathSelector.listEligibleFiles(DFSPathSelector.java:156) at org.apache.hudi.utilities.sources.helpers.DFSPathSelector.getNextFilePathsAndMaxModificationTime(DFSPathSelector.java:119) ... 16 more Exception in thread "main" org.apache.spark.SparkException: Application application_1626707666820_0001 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1163) at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1543) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 21/07/19 15:16:47 INFO ShutdownHookManager: Shutdown hook called 21/07/19 15:16:47 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-a63925d2-78ce-47f2-8178-f263625cec12 21/07/19 15:16:47 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-25cfd265-2583-4713-a05c-2c616ee2de6e Command exiting with ret '1'``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
