gaodayue commented on issue #6036: use S3 as a backup storage for hdfs deep 
storage
URL: https://github.com/apache/incubator-druid/pull/6036#issuecomment-407288499
 
 
   Hi @jihoonson , thanks for your comments. Answering your question
   
   > I think, if this is the case, you might need to somehow increase write 
throughput of your HDFS or use a separate deep storage. If the first option is 
not available for you, does it make sense to use only S3 as your deep storage?
   
   Our company operates its own big hadoop cluster (>5k nodes) for us to use. 
Switching to s3-deep-storage requires extra cost and is not an option for us.
   
   > Maybe we need to define the concept of backup deep storage for all deep 
storage types and support it. 
   
   I've thought about implementing something like composite-deep-storage which 
can add backup abilities to all deep storages at first, but found it's 
non-trivial to load multiple deep storage extensions inside 
composite-deep-storage. So I decide to add support hdfs-deep-storage only just 
because we're using it.
   
   > Maybe the primary deep storage and backup deep storage should be in sync 
automatically.
   
   What do you mean by "in sync"? Do you mean all segments pushed to backup 
storage should be copied back to primary storage eventually? If that's the 
case, I don't think there is a strong need for it (explained below). 
   
   > But, this PR is restricted to support it for only HDFS deep storage and 
looks to require another tool, called restore-hdfs-segment, to keep all 
segments to reside in HDFS. This would need additional operations which make 
Druid operation difficult.
   
   First, the restore-hdfs-segment tool is not required to achieve the goal of 
hdfs fault tolerant. I developed it for other reasons. One is to pay less for 
S3 and the other is that we occasionally need to migrate datasource from one 
cluster to another, and we want all segments reside on hdfs so that we can 
simply use the insert-segment-to-db tool to migrate all segments. If other 
users don't have the same concern, they can simply ignore restore-hdfs-segment.
   
   Second, concerning operation complexity, I think it's just a trade-off made 
between availability and cost. And the extra operational cost is as low as run 
restore-hdfs-segment manually after a hdfs failure or set up a daily crontab to 
run restore-hdfs-segment.
   
   > Kafka indexing service guarantees exactly-once data ingestion, and thus 
data loss is never expected to happen. If deep storage is not available, all 
attempts to publish segments would fail and every task should restart from the 
same offset when publishing failed. 
   
   Yeah I'm aware of it. But for other reasons we are still using tranquility 
as the main ingestion tool and hdfs failure do cause data loss several times 
and it's a big pain for us. We've added this feature to solve the problem, and 
I think maybe it's also useful for other people.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to