[GitHub] [incubator-pinot] jackjlli commented on pull request #6479: Support data ingestion for generating offline segment in one pass

GitBox Mon, 01 Feb 2021 11:02:14 -0800


jackjlli commented on pull request #6479:
URL: https://github.com/apache/incubator-pinot/pull/6479#issuecomment-771083123



   > > Thanks for the details @jackjlli, could you also is IntermediateSegment 
better than existing MutableSegment? For example, you could stream input data 
to MutableSegment and flush it as needed. This also solves multiple problems:
   > > 
   > > * Common code base for offline and RT segment generation (at least for 
the streaming part).
   > > * Sorting can now be done for offline within SegmentGeneration, instead 
of having users to explicitly do so.
   > > * Auto segment sizing that happens in RT will can also be done with 
offline now.
   > > 
   > > Thoughts @jackjlli @Jackie-Jiang?
   > 
   > I think this is a good idea to explore, but I suspect memory utilization 
on the offline side may go up significantly.
   > 
   > Also, the auto-segment sizing in realtime is implemented (in the 
controller) by learning the history of segments already completed. For offline 
generation, if we can keep a history or some learning mechanism, then it may be 
possible to implement approximate segment sizing algorithms -- whether we use 
MutableSegment to build segments or not.
   
   1. Yes, memory utilization will go up significantly, that's why I didn't 
directly use `MutableSegment` but `IntermediateSegment` as the intermediate 
container here.  In fact, both `IntermediateSegment` and `MutableSegment` share 
the common minimal piece of logic, which is that both have forwarded index. The 
slight difference is that `MutableSegment` will have all the indices (if 
applicable) like inverted index, text index, etc, for querying purposes. 
`IntermediateSegment` just keep the minimal component like dictionary.
   2. Plus, if we want partitioning/ sorting, these steps can be done in the 
platform (like mapreduce, spark) before converting the raw data. In fact, we've 
already had that logic in LinkedIn. Once this PR is committed, we can consider 
open sourcing that spark code as well.
   3. Auto-segment size is a good idea that historical data can be used to 
predict the cardinality or buffer size. While offline segment generation is not 
always done on the same machine, the historical data would be meaningless if 
they cannot be reused. If historical data is from controller, then all the 
worker machines have to query pinot controller simultaneously in order to get 
the historical data, which could bring huge amount of queries to controller. 
That's why I didn't bring it here in this PR. We can always add it to 
`IntermediateSegment` in the future, since the structure between 
`IntermediateSegment` and `MutableSegment` are pretty much the same.
   
   All 3 points above are really good features, but it'd be too much to be in a 
single PR. It'd be good if we can leave room for those features and pick them 
up in the following PRs if applicable. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] jackjlli commented on pull request #6479: Support data ingestion for generating offline segment in one pass

Reply via email to