zeroshade commented on issue #44135:
URL: https://github.com/apache/arrow/issues/44135#issuecomment-2353801596

   Most likely consumers of this service wouldn't be retrieving the entire PB 
of data for each request right? They'd be requesting some subset of the data, 
correct?
   
   Thus the flight service would simply be a go-between to centralize access 
and logic for retrieving the data from S3, while potentially providing a cache. 
Each service would never load the entire dataset in memory, it would just 
stream portions of it, possibly caching bits and pieces to avoid calling all 
the way out to S3.
   
   > Secondly when we have incremental updates or new data arrives how we 
refresh the data ?
   
   You'd have to perform cache invalidation for any caches in the services (the 
simplest way might be to just do a Stat check on the relevant data in S3 and 
use the cache if the "last modified timestamp" matches the cache) and just pull 
it directly from S3 when it's been updated and the local cache is out of date.
   
   > Thirdly we are planning to deploy into multiple nodes in a k8s cluster 
then every node will maintain a copy of data, I am concerned on such a huge 
data loaded in-memory and ideal way to sync data
   
   Same comment as before, you wouldn't pull the entire dataset as copies into 
each service instance. That's incredibly inefficient and unnecessary. The ideal 
way here is as I mentioned before, stream directly from S3 while optionally 
caching the data locally if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to