WTa-hash opened a new issue #4241: URL: https://github.com/apache/hudi/issues/4241
Are there any tips or support on setting up a Disaster Recovery (DR) environment with Apache Hudi? We are creating our Datalake, stored on AWS S3, by running a Spark structured streaming application on AWS EMR. The Spark application is processing incoming data from a AWS Kinesis stream and saving them as Hudi tables on S3 and syncing with the AWS Glue catalog. All of this happens in a single AWS region (us-east-1). In the event where we need to failover to a different region or our main region (us-east-1) goes down, what is the suggested approach to get start up again in another AWS region with our existing Datalake data? We can set up S3 replication to replicate the parquet files (and .hoodie files) to another S3 bucket residing in a different AWS region, but S3 replication happens asynchronously which means files may get replicated out of order and cause issues when querying (due to possible missing files). **Environment Description** * Hudi version : 0.7.0-amzn-1 * Spark version : 2.4.7 * Hive version : 2.3.7 * Hadoop version : 2.10.1 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
