zeroshade commented on issue #44135: URL: https://github.com/apache/arrow/issues/44135#issuecomment-2353801596
Most likely consumers of this service wouldn't be retrieving the entire PB of data for each request right? They'd be requesting some subset of the data, correct? Thus the flight service would simply be a go-between to centralize access and logic for retrieving the data from S3, while potentially providing a cache. Each service would never load the entire dataset in memory, it would just stream portions of it, possibly caching bits and pieces to avoid calling all the way out to S3. > Secondly when we have incremental updates or new data arrives how we refresh the data ? You'd have to perform cache invalidation for any caches in the services (the simplest way might be to just do a Stat check on the relevant data in S3 and use the cache if the "last modified timestamp" matches the cache) and just pull it directly from S3 when it's been updated and the local cache is out of date. > Thirdly we are planning to deploy into multiple nodes in a k8s cluster then every node will maintain a copy of data, I am concerned on such a huge data loaded in-memory and ideal way to sync data Same comment as before, you wouldn't pull the entire dataset as copies into each service instance. That's incredibly inefficient and unnecessary. The ideal way here is as I mentioned before, stream directly from S3 while optionally caching the data locally if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
