[GitHub] [hudi] chb777777 opened a new issue #3301: [SUPPORT] Deltastreamer cannot pattern match AWS s3 files with wildcard

GitBox Mon, 19 Jul 2021 09:05:30 -0700


chb777777 opened a new issue #3301:
URL: https://github.com/apache/hudi/issues/3301



   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I want to perform incremental load on around 10 terabyte of data by 
subfolders using wildcard (*) similar to how we can do it in spark. It appears 
that wildcard is not recognized. e.g. --hoodie-conf 
hoodie.deltastreamer.source.dfs.root s3://mybucket/root/folder1*
   
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create s3 bucket with some subfolders
   2. add dummy parquet files to load in each subfolder
   3. submit EMR job with hudi config --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://mylake-landing/FULL/CP_TEST_TABLE/202005*/*
   4. 
   
   **Expected behavior**
   
   files with full paths are loaded and other files are not loaded. 
   such as s3://mylake-landing/FULL/CP_TEST_TABLE/20200520, 
s3://mylake-landing/FULL/CP_TEST_TABLE/20200521, 
s3://mylake-landing/FULL/CP_TEST_TABLE/20200522
   
   **Environment Description**
   
   * Hudi version : 0.7
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.7
   
   * Hadoop version : Amazon 2.10.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   * EMR Version: 5.33.0 
   
   
   **Additional context**
   How does one go about running incremental load for large historical load in 
hudi? Any regex features?
   
   **Stacktrace**
   
   ```
         ApplicationMaster host: ip-10-67-29-149.ec2.internal
         ApplicationMaster RPC port: 43459
         queue: root.hadoop
         start time: 1626707782404
         final status: FAILED
         tracking URL: 
http://ip-10-67-29-149.ec2.internal:20888/proxy/application_1626707666820_0001/
         user: hadoop
   21/07/19 15:16:47 ERROR Client: Application diagnostics message: User class 
threw exception: org.apache.hudi.exception.HoodieIOException: Unable to read 
from source from checkpoint: Option{val=0}
        at 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector.getNextFilePathsAndMaxModificationTime(DFSPathSelector.java:147)
        at 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector.getNextFilePathsAndMaxModificationTime(DFSPathSelector.java:102)
        at 
org.apache.hudi.utilities.sources.ParquetDFSSource.fetchNextBatch(ParquetDFSSource.java:48)
        at 
org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
        at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:75)
        at 
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:94)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:338)
        at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:255)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
        at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
        at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:688)
   Caused by: java.io.FileNotFoundException: File 
s3://mylake-landing/FULL/CP_TEST_TABLE/202005* does not exist.
        at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.listStatus(S3NativeFileSystem.java:709)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1849)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1899)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1893)
        at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.listStatus(EmrFileSystem.java:480)
        at 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector.listEligibleFiles(DFSPathSelector.java:156)
        at 
org.apache.hudi.utilities.sources.helpers.DFSPathSelector.getNextFilePathsAndMaxModificationTime(DFSPathSelector.java:119)
        ... 16 more
   
   Exception in thread "main" org.apache.spark.SparkException: Application 
application_1626707666820_0001 finished with failed status
        at org.apache.spark.deploy.yarn.Client.run(Client.scala:1163)
        at 
org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1543)
        at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
        at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   21/07/19 15:16:47 INFO ShutdownHookManager: Shutdown hook called
   21/07/19 15:16:47 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-a63925d2-78ce-47f2-8178-f263625cec12
   21/07/19 15:16:47 INFO ShutdownHookManager: Deleting directory 
/mnt/tmp/spark-25cfd265-2583-4713-a05c-2c616ee2de6e
   Command exiting with ret '1'```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] chb777777 opened a new issue #3301: [SUPPORT] Deltastreamer cannot pattern match AWS s3 files with wildcard

Reply via email to