brachipa opened a new issue, #30498:
URL: https://github.com/apache/beam/issues/30498
### What happened?
We have many `SqlTransform` in our pipeline, running SQL query and sink
output into external system (Kinesis), we found that in some cases our SQL
results aren't matching the SQL we provided.
for example the bellow simple SQL:
```
select event as event_name, count(*) as c
from PCOLLECTION
group by event
```
returns output with fields "event" and "c" **ignoring the alias
"event_name"**
It happens without any reason, and by applying some simple change (e.g
removing one column from group by ), the query stopped outputting data into the
"event_name" only to "event" column and it caused us to huge data loss, since
all our metrics are based on event_name, and this one become always null.
We need to understand what causes aliases to be ignored, since we are using
this pattern in most queries and in different fields. all are in production
critical system.
logs for the above query:
```
2024-03-05 12:41:53.351 INFO 54818 --- [o-auto-1-exec-1]
o.a.b.s.e.sql.impl.CalciteQueryPlanner : SQL:
SELECT `PCOLLECTION`.`event` AS `event_name`, COUNT(*) AS `c`
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
GROUP BY `PCOLLECTION`.`event`
2024-03-05 12:41:59.172 INFO 54818 --- [o-auto-1-exec-1]
o.a.b.s.e.sql.impl.CalciteQueryPlanner : SQLPlan>
LogicalAggregate(group=[{0}], c=[COUNT()])
LogicalProject(event_name=[$2])
BeamIOSourceRel(table=[[beam, PCOLLECTION]])
2024-03-05 12:43:06.961 INFO 54818 --- [o-auto-1-exec-1]
o.a.b.s.e.sql.impl.CalciteQueryPlanner : BEAMPlan>
BeamAggregationRel(group=[{2}], c=[COUNT()])
BeamIOSourceRel(table=[[beam, PCOLLECTION]])
```
But the row we get:
row.toString():
```
Row:
event:"abc"
c:1
```
The pipeline steps look like:
```
.apply(SqlTransform.query(query))
.apply("to-table-row", ParDo.of(new DoFn<Row, TableRow>() {
@ProcessElement
public void processElement(@Element Row row,
ProcessContext context) {
System.out.println(row.toString());
TableRow tableRow = BigQueryUtils.toTableRow(row);
context.output(tableRow);
}
}))
```
Adding more fields to the query can sometimes help, but this is not always
the case, and we turned into situation that we can't edit our queries since we
don't know what the impact will be.
Query that works:
```
select event as event_name,a,b,c,d,e,f,g, count(*) as c
from PCOLLECTION
group by event,a,b,c,d,e,f,g
```
Logs:
```
2024-03-05 12:51:51.168 INFO 56085 --- [o-auto-1-exec-1]
o.a.b.s.e.sql.impl.CalciteQueryPlanner : SQL:
SELECT `PCOLLECTION`.`event` AS `event_name`, `PCOLLECTION`.`a`,
`PCOLLECTION`.`b`, `PCOLLECTION`.`c`, `PCOLLECTION`.`d`, `PCOLLECTION`.`e`,
`PCOLLECTION`.`f`, `PCOLLECTION`.`g`, COUNT(*) AS `c`
FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
GROUP BY `PCOLLECTION`.`event`, `PCOLLECTION`.`a`, `PCOLLECTION`.`b`,
`PCOLLECTION`.`c`, `PCOLLECTION`.`d`, `PCOLLECTION`.`e`, `PCOLLECTION`.`f`,
`PCOLLECTION`.`g`
2024-03-05 12:51:55.128 INFO 56085 --- [o-auto-1-exec-1]
o.a.b.s.e.sql.impl.CalciteQueryPlanner : SQLPlan>
LogicalProject(event_name=[$0], a=[$1], b=[$2], c=[$3], d=[$4], e=[$5],
f=[$6], g=[$7], c0=[$8])
LogicalAggregate(group=[{0, 1, 2, 3, 4, 5, 6, 7}], c=[COUNT()])
LogicalProject(event_name=[$2], a=[$3], b=[$4], c=[$5], d=[$6], e=[$7],
f=[$8], g=[$9])
BeamIOSourceRel(table=[[beam, PCOLLECTION]])
2024-03-05 12:52:17.640 INFO 56085 --- [o-auto-1-exec-1]
o.a.b.s.e.sql.impl.CalciteQueryPlanner : BEAMPlan>
BeamCalcRel(expr#0..8=[{inputs}], proj#0..8=[{exprs}])
BeamAggregationRel(group=[{2, 3, 4, 5, 6, 7, 8, 9}], c=[COUNT()])
BeamIOSourceRel(table=[[beam, PCOLLECTION]])
```
Row looks like:
```
Row: Row:
event_name:abc
a:<null>
b:<null>
c:<null>
d:<null>
e:<null>
f:<null>
g:<null>
c0:1
````
Even_name column is correct, but why C0? and not c?
Also, it is not always number of group by fields, if I send the exact same
query, only with real fields (not a,b,c) it is again stop sending event_name
only event.
Can you explain this strange bug?
### Issue Priority
Priority: 1 (data loss / total loss of function)
### Issue Components
- [ ] Component: Python SDK
- [X] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [ ] Component: Google Cloud Dataflow Runner
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]