[GitHub] [incubator-pinot] reallocf commented on issue #5268: Filter during ingestion

GitBox Thu, 23 Apr 2020 13:23:44 -0700


reallocf commented on issue #5268:
URL: 
https://github.com/apache/incubator-pinot/issues/5268#issuecomment-618646220



   @npawar - thanks for the feedback! Sorry for the delayed response - got 
pulled into a new project at work and was really heads down for the past few 
days. Let me know if there's any time expectations you'd like to set here - 
right now I'm scoping for the next couple of weeks (I'm more or less just 
working off-work hours) but if y'all need this done more quickly then let me 
know and I'll prioritize for it or bow out and let somebody else take it.
   
   My thoughts:
   I'm thinking supporting two kinds of filtering might make the most sense: 
preFilter and postFilter. Pre-filtering is nice because it means, if the source 
row doesn't fit the filter, it can remove it without having to calculate any of 
the derived values. Post-filtering is nice because it allows means that any 
logic embedded in one of the columns doesn't have to be duplicated when 
calculating it for the filter (like in your example, you could just put the 
`maxBid` logic in a preFilter and then have that be a part of the conditional 
logic, but you're then duplicating that work for the actual column 
calculation). From an interface perspective, this also makes it slightly more 
sensible I think.
   
   If we made it column-specific, that makes defining filters across columns 
awkward. Maybe we don't want to support them? There might be a performance 
improvement if we have a filter on a derived column - if that column's 
calculation completes early and is filtered out, it could skip the execution of 
other derived columns - but that seems like it might be optimizing for a very 
specific use case. We could also add this later if we think it'll be useful in 
the future.
   
   For top-level filters being a column, someone _could_ just throw everything 
into a single `Groovy({column1 + column2 + ... + columnN > x}, column1, 
column2, ..., columnN)`. For multiple different conditions one could use `&&`. 
There might be a bit more clarity with having it be a list of multiple 
conditionals, but we might take a slight performance hit when executing the 
additional Groovy scripts.
   
   Yeah, my vote is definitely for supporting expressions with multiple 
columns. But, I don't know the internals of the code really at all yet - do you 
think just focusing on a column-based approach would make more sense from an 
implementation perspective?
   
   Read through the doc - very nice! :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] reallocf commented on issue #5268: Filter during ingestion

Reply via email to