jgutmann commented on issue #5588: URL: https://github.com/apache/incubator-pinot/issues/5588#issuecomment-649098852
+1 On what @mcvsubbu stated above, after going through and using the tool myself, here's some additional thoughts 1. It might be interesting if we could point the tool at a stream and the tool could consume its own segment from the stream. If the table config contains the stream configs then it should be able to use these to start up a pinot-server instance and consume. (Perhaps this is harder than I estimate though) One caveat here is that it traffic on the stream might vary, so consuming starting from the highest offset might not give reliable results (ie - consuming only as it become available). What happens if we are running the tool with this "live consumption" feature during a period where the event stream has abnormally low traffic, the estimation wouldn't be representative. If we could consume from the smallest offset (ie - consume historical data), we could see how it would perform over a longer sample period and gather more data. Additionally, this method would allow the tool to be run now and process the historical data rather than having it consume for a few hours to get enough "new" events (as in consuming from largest) 2. What if we could build some kind of recommendation engine into pinot-server itself? We could create the table in pinot-server using "off-the-shelf" default options. Every so often (few hours, at segment close, etc) pinot-server could analyze itself and then output a matrix similar to what this tool outputs, in the current logs, create a new log, something. Operationally, this would allow us to create a table, then come back after a day or two to check the logs and have a recommendation waiting for us. This could open up potentially expanding the auto-tuning functionality. Pinot would be able to know how many instances are present and auto tune for that instance count. If we could output a metric to act on if we are not in an "optimal zone" for that number of segments, we could act on this metric with auto-scaling the number of instances up or down. By scaling instances up or down the segment sizing could auto-tune to adapt to the new instance count (perhaps this is a cluster level config to enable this feature) 3. Give better command line options for making the tool more intuitive. When using an offline segment for estimation, this was kind of clunky - Specify the segment, then figure out the number of rows in the segment, divide by the event rate, specify the result of this division as the time period for the segment. We could potentially specify we are passing an offline segment, also allowing the "Average Event Rate" to be passed as an argument might simplify manual steps for the user. (Reduces the level of understanding needed and misc math to be done when figuring out what the arguments should be) Ideally I should be able to not know much about how the tool works, and just pass some trivial arguments (IE, here's a segment and X) then it can interpolate needed data from there. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
