[I] [Bug]: [beam]

via GitHub Tue, 05 Mar 2024 03:33:26 -0800


brachipa opened a new issue, #30498:
URL: https://github.com/apache/beam/issues/30498


   ### What happened?
   
   We have many `SqlTransform` in our pipeline, running  SQL query and sink 
output into external system (Kinesis), we found that in some cases our SQL 
results aren't matching the SQL we provided.
   for example the bellow simple SQL: 
   ```
   select event as event_name, count(*) as c
   from PCOLLECTION
   group by event
   ```
   returns output with fields "event" and "c" **ignoring the alias 
"event_name"**
   
   It happens without any reason, and by applying some simple change  (e.g 
removing one column from group by ), the query stopped outputting data into the 
"event_name" only to "event" column and it caused us to huge data loss, since 
all our metrics are based on event_name, and this one become always null.
   
   We need to understand what causes aliases to be ignored, since we are using 
this pattern in most queries and in different fields. all are in production 
critical system.
   
   logs for the above query:
   
   ```
   2024-03-05 12:41:53.351  INFO 54818 --- [o-auto-1-exec-1] 
o.a.b.s.e.sql.impl.CalciteQueryPlanner   : SQL:
   SELECT `PCOLLECTION`.`event` AS `event_name`, COUNT(*) AS `c`
   FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
   GROUP BY `PCOLLECTION`.`event`
   2024-03-05 12:41:59.172  INFO 54818 --- [o-auto-1-exec-1] 
o.a.b.s.e.sql.impl.CalciteQueryPlanner   : SQLPlan>
   LogicalAggregate(group=[{0}], c=[COUNT()])
     LogicalProject(event_name=[$2])
       BeamIOSourceRel(table=[[beam, PCOLLECTION]])
   
   2024-03-05 12:43:06.961  INFO 54818 --- [o-auto-1-exec-1] 
o.a.b.s.e.sql.impl.CalciteQueryPlanner   : BEAMPlan>
   BeamAggregationRel(group=[{2}], c=[COUNT()])
     BeamIOSourceRel(table=[[beam, PCOLLECTION]])
   ```
   
   But the row we get:
   row.toString():
   ```
   Row: 
   event:"abc"
   c:1
   ```
   The pipeline steps look like:
   ```
   .apply(SqlTransform.query(query))
                   .apply("to-table-row", ParDo.of(new DoFn<Row, TableRow>() {
                       @ProcessElement
                       public void processElement(@Element Row row, 
ProcessContext context) {
                           System.out.println(row.toString());
                           TableRow tableRow = BigQueryUtils.toTableRow(row);
                           context.output(tableRow);
                       }
                   }))
   
   
   ```
   
   Adding more fields to the query can sometimes help, but this is not always 
the case, and we turned into situation that we can't edit our queries since we 
don't know what the impact will be.
   
   Query that works:
   ```
   select event as event_name,a,b,c,d,e,f,g, count(*) as c
   from PCOLLECTION
   group by event,a,b,c,d,e,f,g
   ```
   
   
   Logs:
   ```
   2024-03-05 12:51:51.168  INFO 56085 --- [o-auto-1-exec-1] 
o.a.b.s.e.sql.impl.CalciteQueryPlanner   : SQL:
   SELECT `PCOLLECTION`.`event` AS `event_name`, `PCOLLECTION`.`a`, 
`PCOLLECTION`.`b`, `PCOLLECTION`.`c`, `PCOLLECTION`.`d`, `PCOLLECTION`.`e`, 
`PCOLLECTION`.`f`, `PCOLLECTION`.`g`, COUNT(*) AS `c`
   FROM `beam`.`PCOLLECTION` AS `PCOLLECTION`
   GROUP BY `PCOLLECTION`.`event`, `PCOLLECTION`.`a`, `PCOLLECTION`.`b`, 
`PCOLLECTION`.`c`, `PCOLLECTION`.`d`, `PCOLLECTION`.`e`, `PCOLLECTION`.`f`, 
`PCOLLECTION`.`g`
   2024-03-05 12:51:55.128  INFO 56085 --- [o-auto-1-exec-1] 
o.a.b.s.e.sql.impl.CalciteQueryPlanner   : SQLPlan>
   LogicalProject(event_name=[$0], a=[$1], b=[$2], c=[$3], d=[$4], e=[$5], 
f=[$6], g=[$7], c0=[$8])
     LogicalAggregate(group=[{0, 1, 2, 3, 4, 5, 6, 7}], c=[COUNT()])
       LogicalProject(event_name=[$2], a=[$3], b=[$4], c=[$5], d=[$6], e=[$7], 
f=[$8], g=[$9])
         BeamIOSourceRel(table=[[beam, PCOLLECTION]])
   
   2024-03-05 12:52:17.640  INFO 56085 --- [o-auto-1-exec-1] 
o.a.b.s.e.sql.impl.CalciteQueryPlanner   : BEAMPlan>
   BeamCalcRel(expr#0..8=[{inputs}], proj#0..8=[{exprs}])
     BeamAggregationRel(group=[{2, 3, 4, 5, 6, 7, 8, 9}], c=[COUNT()])
       BeamIOSourceRel(table=[[beam, PCOLLECTION]])
   
   ```
   
   Row looks like:
   ```
   Row: Row: 
   event_name:abc
   a:<null>
   b:<null>
   c:<null>
   d:<null>
   e:<null>
   f:<null>
   g:<null>
   c0:1
   ````
   Even_name column is correct, but why C0? and not c?
   
   Also, it is not always number of group by fields, if I send the exact same 
query, only with real fields (not a,b,c) it is again stop sending event_name 
only event. 
   
   Can you explain this strange bug? 
   
   
   
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [X] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Bug]: [beam]

Reply via email to