gaodayue opened a new pull request #6036: use S3 as a backup storage for hdfs 
deep storage
URL: https://github.com/apache/incubator-druid/pull/6036
 
 
   This PR improves the overall availability of hdfs-deep-storage by pushing 
data to S3 when HDFS is temporarily not available.
   
   # Motivation
   
   In many organization, Hadoop and HDFS are typically used in offline data 
analysis, while Druid targets online data serving. Thus SLA provided by HDFS 
often can't meet the needs of Druid. Consequently, users of hdfs-deep-storage 
often encounter task failures due to temporarily unavailable of HDFS. Task 
failures can cause data re-processing or even data loss depending on whether 
kafka-indexing-service or tranquility are used for realtime ingestion.
   
   # Goal
   
   Make segment handover continue to work even if HDFS is not available.
   
   # Approach taken by this PR
   
   We leverage the S3AFileSystem provided by the HDFS client library to support 
using S3 as a backup storage for HDFS. When we can't push segments or task logs 
to HDFS, we switch to S3 instead. By using S3 as a backup for HDFS, the overall 
availability of hdfs-deep-storage is increased.
   
   For segments pushed to S3, loadSpec is changed to `{"type":"hdfs", 
"path":"s3a://..."}`. Since file access is done with FileSystem abstraction, 
there is no need to change HdfsDataSegmentPuller.
   
   The following new configuration knobs are added to hdfs-deep-storage and 
hdfs task logs, please refer to doc changes in detail
   * druid.storage.useS3Backup
   * druid.storage.backupS3Bucket
   * druid.storage.backupS3BaseKey
   * druid.indexer.logs.useS3Backup
   * druid.indexer.logs.backupS3Bucket
   * druid.indexer.logs.backupS3BaseKey
   
   Besides what's included in this PR, I've also implemented a tool called 
`restore-hdfs-segment` to migrate segments temporarily pushed to S3 back to 
HDFS. This can free up spaces in S3 as well as make all segments reside on HDFS 
eventually. If you like the idea, I can send another PR for the tool later.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to