realno commented on issue #1486: URL: https://github.com/apache/arrow-datafusion/issues/1486#issuecomment-1016800036
> @realno you can take this with a grain of salt as I am new to this. > > My thinking is that I would prefer to see the exact median implementation before having an approximate (i.e the approximate would be an add-on feature). I could be wrong but I believe datafusion had `DISTINCT` before `approx_distinct`. > > Regarding the implementation - I thought that we would be able to use existing arrow compute kernels for this and not have to re-implement existing functionality: > > * sort: https://docs.rs/arrow/latest/arrow/compute/kernels/sort/fn.sort.html > * length: https://docs.rs/arrow/latest/arrow/array/trait.Array.html#method.len > * value: https://docs.rs/arrow/latest/arrow/array/struct.PrimitiveArray.html#method.value > > I suppose this would be somewhere between your Option 1 and Option 2. > > i definitely defer to @alamb though. Thanks for the comments @matthewmturner . I am also new and wouldn't call myself database internal expert :) Yes we have all the functionality ready, the complication is what's the best/most efficient way to implement this. I definitely want to hear more opinions on this. Do you think it worth having a approximation to unblock the perf benchmark work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
