ozankabak commented on PR #3570:
URL: 
https://github.com/apache/arrow-datafusion/pull/3570#issuecomment-1260839211

   Hello again! We resolved almost all the points you've mentioned. 
Additionally, we identified some challenging corner cases regarding NULL 
treatment and updated our code to handle these cases, along with new unit tests 
for these corner cases. Three main discussion points remain:
   
   *PostgreSQL compatibility of new unit tests*
   
   Expected values of all the new unit tests were sanity-checked against 
PostgreSQL. To remain in sync with PostgreSQL, we also think that it could be a 
good idea to add psql-parity tests for NULL-treatment cases. If you agree, 
let's make this subject of another PR. We can open an issue to track this (and 
will be happy to work on a PR to resolve it).
   
   *Changes to `ScalarValue`*
   
   At this point, we do not really have any new functionality or complexity 
added to this type. We just tidied up/moved some `ScalarValue`-only code that 
was scattered across multiple modules to `scalar.rs`. For example, the 
coercions you saw were already a part of the code here:
   
   
https://github.com/apache/arrow-datafusion/blob/87faf868f2276b84e63cad6721ca08bd79ed9cb8/datafusion/physical-expr/src/aggregate/sum.rs#L261
   
   We agree that there should be no coercions in general at this stage -- any 
necessary work along these lines should be handled previously in the code path. 
However, we don't want to extend the scope of this PR to remove 
already-existing coercions from the codebase. If there is no issue about this, 
we are happy to open an issue for this and help resolve it in the near future 
along with similar issues like 
[#3511](https://github.com/apache/arrow-datafusion/issues/3511).
   
   Some teaser: I have done some preliminary tests and it seems like removing 
all the coercions will not be a tough task at all. However, removing this kind 
of code always requires more careful testing than just preliminary testing and 
deserves a careful study/separate PR 🙂
   
   *Bisect*
   
   We considered using the 
[`partition_point`](https://doc.rust-lang.org/std/primitive.slice.html#method.partition_point)
  function. This requires converting `&[ArrayRef]` to `&[&[ScalarValue]]`, 
where the inner `&[ScalarValue]` corresponds to each row (Because we are 
searching the insertion place of target `&[ScalarValue]` among rows). Unless we 
are missing something, doing this will require unnecessary memory usage and CPU 
time since it will involve copying. On the contrary, our implementation takes 
data in original columnar format and calculates rows from the index only when 
the comparison is done. Since we are doing *log(n)* comparisons on average, we 
need *log(n)* copies instead of *n* copies. If you have any suggestions for 
adding bisect functionality without changing the columnar orientation of the 
data, we can create an issue for this and help resolve it in a subsequent PR.
   
   Thanks again for all the reviews. Excited to see this merge and get used!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to