attilapiros commented on pull request #30788:
URL: https://github.com/apache/spark/pull/30788#issuecomment-764926711
I checked this again and I would like to suggest some more changes.
1) Change the PR description and explain the problem more detailed (see
below).
2) As in `FixedLengthRowBasedKeyValueBatch` the `vlen` and `klen` are
already stored as member variables in the constructor we could use those values
to validate the added records' lengths. So I would add an `assert` (or at
least WARN/ERROR log) to `FixedLengthRowBasedKeyValueBatch#appendRow` method to
check the incoming `vlen` and `klen` by comparing them with the lengths stored
as member variables.
# What changes were proposed in this pull request?
The `RowBasedKeyValueBatch` has two different implementations depending on
whether the aggregation key and value uses only fixed length data types
(`FixedLengthRowBasedKeyValueBatch`) or not
(`VariableLengthRowBasedKeyValueBatch`).
Before this PR the decision about the used implementation was based on by
accessing the schema fields by their name.
But if two fields has the same name and one with variable length and the
other with fixed length type (and all the other fields are with fixed length
types) a bad decision could be made.
When `FixedLengthRowBasedKeyValueBatch` is chosen but there is a variable
length field then an aggregation function could calculate with invalid values.
This case is illustrated by the example used in the unit test:
```
with T as (select id as a, -id as x from range(3)),
U as (select id as b, cast(id as string) as x from range(3))
select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by
U.x, T.x
```
where the 'x' column in the left side of the join is a Long but on the
right side is a String.
---
cc @cloud-fan what do you think (especially about 2nd point)?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]