cshannon commented on issue #4455:
URL: https://github.com/apache/accumulo/issues/4455#issuecomment-2052251333

   I would be cautious with a change like this without some 
benchmarking/testing and probably a way to configure or turn it off.
   
   Fsync operations are generally slow and can easily impact ingestion 
performance. Some systems like Kafka don't recommend using it and instead 
relying on replication for durability. Kafka gets its performance from using 
page cache and letting the OS flush to disk async and the durability comes from 
the fact that producers will get acks from more than one system that a message 
was received. Kafka (and also ActiveMQ) support setting a flush interval so you 
can flush periodically on a timer to tune how much you are syncing.
   
   Hadoop is obviously a bit different and I'm not sure what durability 
guarantees it provides with data replication without calling hsync but it is 
worth looking into to see if this is necessary. I ultimately think we need to 
research this and see if it's really a good idea or providing enough benefit 
before we start calling hsync(). It may be the case that performance is fine if 
only calling it once per file but it's hard to say without testing. And lastly, 
as I said, it would be good to make this configurable to be disabled I think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to