attilapiros commented on pull request #30788:
URL: https://github.com/apache/spark/pull/30788#issuecomment-764926711


   I checked this again and I would like to suggest some more changes. 
   
   1) Change the PR description and explain the problem more detailed (see 
below).
   
   2) As in `FixedLengthRowBasedKeyValueBatch` the `vlen` and `klen` are 
already stored as member variables in the constructor we could use those values 
to validate the added records' lengths. So I would add  an `assert` (or at 
least WARN/ERROR log) to `FixedLengthRowBasedKeyValueBatch#appendRow` method to 
check the incoming `vlen` and `klen` by comparing them with the lengths stored 
as member variables.
   
   # What changes were proposed in this pull request?
   
   The `RowBasedKeyValueBatch` has two different implementations depending on 
whether the aggregation key and value uses only fixed length data types 
(`FixedLengthRowBasedKeyValueBatch`) or not 
(`VariableLengthRowBasedKeyValueBatch`). 
   
   Before this PR the decision about the used implementation was based on by 
accessing the schema fields by their name.
   But if two fields has the same name and one with variable length and the 
other with fixed length type (and all the other fields are with fixed length 
types) a bad decision could be made. 
   
   When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable 
length field then an aggregation function could calculate with invalid values. 
This case is illustrated by the example used in the unit test:
   
   ```
   with T as (select id as a, -id as x from range(3)), 
           U as (select id as b, cast(id as string) as x from range(3)) 
   select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by 
U.x, T.x
   ```
   
    where the 'x' column in the left side of the join is a Long but on the 
right side is a String.
   
   --- 
   
   cc @cloud-fan what do you think (especially about 2nd point)?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to