cshannon commented on issue #4455: URL: https://github.com/apache/accumulo/issues/4455#issuecomment-2052251333
I would be cautious with a change like this without some benchmarking/testing and probably a way to configure or turn it off. Fsync operations are generally slow and can easily impact ingestion performance. Some systems like Kafka don't recommend using it and instead relying on replication for durability. Kafka gets its performance from using page cache and letting the OS flush to disk async and the durability comes from the fact that producers will get acks from more than one system that a message was received. Kafka (and also ActiveMQ) support setting a flush interval so you can flush periodically on a timer to tune how much you are syncing. Hadoop is obviously a bit different and I'm not sure what durability guarantees it provides with data replication without calling hsync but it is worth looking into to see if this is necessary. I ultimately think we need to research this and see if it's really a good idea or providing enough benefit before we start calling hsync(). It may be the case that performance is fine if only calling it once per file but it's hard to say without testing. And lastly, as I said, it would be good to make this configurable to be disabled I think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
