jgutmann commented on issue #5588:
URL: 
https://github.com/apache/incubator-pinot/issues/5588#issuecomment-649098852


   +1 On what @mcvsubbu stated above, after going through and using the tool 
myself, here's some additional thoughts
   
   
   
   1. It might be interesting if we could point the tool at a stream and the 
tool could consume its own segment from the stream. 
   
   If the table config contains the stream configs then it should be able to 
use these to start up a pinot-server instance and consume. (Perhaps this is 
harder than I estimate though) 
   
   One caveat here is that it traffic on the stream might vary, so consuming 
starting from the highest offset might not give reliable results (ie - 
consuming only as it become available). What happens if we are running the tool 
with this "live consumption" feature during a period where the event stream has 
abnormally low traffic, the estimation wouldn't be representative. 
   
   If we could consume from the smallest offset (ie - consume historical data), 
we could see how it would perform over a longer sample period and gather more 
data. Additionally, this method would allow the tool to be run now and process 
the historical data rather than having it consume for a few hours to get enough 
"new" events (as in consuming from largest)
   
   
   
   2. What if we could build some kind of recommendation engine into 
pinot-server itself? 
   
   We could create the table in pinot-server using "off-the-shelf" default 
options. Every so often (few hours, at segment close, etc) pinot-server could 
analyze itself and then output a matrix similar to what this tool outputs, in 
the current logs, create a new log, something. Operationally, this would allow 
us to create a table, then come back after a day or two to check the logs and 
have a recommendation waiting for us. 
   
   This could open up potentially expanding the auto-tuning functionality. 
Pinot would be able to know how many instances are present and auto tune for 
that instance count. If we could output a metric to act on if we are not in an 
"optimal zone" for that number of segments, we could act on this metric with 
auto-scaling the number of instances up or down. By scaling instances up or 
down the segment sizing could auto-tune to adapt to the new instance count 
(perhaps this is a cluster level config to enable this feature)
   
   
   
   3. Give better command line options for making the tool more intuitive. 
   
   When using an offline segment for estimation, this was kind of clunky - 
Specify the segment, then figure out the number of rows in the segment, divide 
by the event rate, specify the result of this division as the time period for 
the segment. 
   
   We could potentially specify we are passing an offline segment, also 
allowing the "Average Event Rate" to be passed as an argument might simplify 
manual steps for the user. (Reduces the level of understanding needed and misc 
math to be done when figuring out what the arguments should be) 
   
   Ideally I should be able to not know much about how the tool works, and just 
pass some trivial arguments (IE, here's a segment and X) then it can 
interpolate needed data from there. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to