[GitHub] [arrow-datafusion] hengfeiyang opened a new issue, #6173: SQL with distinct will give wrong field name

via GitHub Sun, 30 Apr 2023 19:54:17 -0700


hengfeiyang opened a new issue, #6173:
URL: https://github.com/apache/arrow-datafusion/issues/6173


   ### Describe the bug
   
   With SQL:
   ```sql
   SELECT id, count(distinct name) as cnt FROM t GROUP BY id ORDER BY cnt DESC
   ```
   will got result:
   ```
   +------+-----+
   | t.id | cnt |
   +------+-----+
   | 8    | 1   |
   | 4    | 1   |
   | 2    | 1   |
   +------+-----+
   ```
   
   I expect the fields name to be `id` and `cnt`, but it give me `t.id`.
   
   With this SQL no problem:
   ```sql
   SELECT id as id, count(distinct name) as cnt FROM t GROUP BY id ORDER BY cnt 
DESC
   ```
   will got result:
   ```
   +----+-----+
   | id | cnt |
   +----+-----+
   | 8  | 1   |
   | 2  | 1   |
   | 4  | 1   |
   +----+-----+
   ```
   
   So, I am not sure if this is a bug, I want to got the field name `id`, I 
must select it like this `id as id`.
   
   
   
   
   ### To Reproduce
   
   
   the test code is:
   
   ```Rust
   use datafusion::arrow::{
       array::Int64Array, array::StringArray, datatypes::DataType, 
record_batch::RecordBatch,
   };
   use datafusion::error::Result;
   use datafusion::from_slice::FromSlice;
   use datafusion::prelude::SessionContext;
   use std::sync::Arc;
   
   // create local session context with an in-memory table
   fn create_context() -> Result<SessionContext> {
       use datafusion::arrow::datatypes::{Field, Schema};
       use datafusion::datasource::MemTable;
       // define a schema.
       let schema = Arc::new(Schema::new(vec![
           Field::new("id", DataType::Int64, false),
           Field::new("name", DataType::Utf8, false),
       ]));
   
       // define data in two partitions
       let batch1 = RecordBatch::try_new(
           schema.clone(),
           vec![
               Arc::new(Int64Array::from_slice([2, 4, 8])),
               Arc::new(StringArray::from_slice(["alex", "smish", "alex"])),
           ],
       )?;
   
       // declare a new context. In spark API, this corresponds to a new spark 
SQLsession
       let ctx = SessionContext::new();
   
       // declare a table in memory. In spark API, this corresponds to 
createDataFrame(...).
       let provider = MemTable::try_new(schema, vec![vec![batch1]])?;
       ctx.register_table("t", Arc::new(provider))?;
       Ok(ctx)
   }
   
   #[tokio::main]
   async fn main() -> Result<()> {
       let ctx = create_context()?;
       let sql = "SELECT id, count(distinct name) as cnt FROM t GROUP BY id 
ORDER BY cnt DESC";
       let df = ctx.sql(&sql).await?;
       df.show().await?;
   
       let sql = "SELECT id as id, count(distinct name) as cnt FROM t GROUP BY 
id ORDER BY cnt DESC";
       let df = ctx.sql(&sql).await?;
       df.show().await?;
       Ok(())
   }
   ```
   
   ### Expected behavior
   
   With SQL:
   ```sql
   SELECT id, count(distinct name) as cnt FROM t GROUP BY id ORDER BY cnt DESC
   ```
   Should get result:
   ```
   +------+-----+
   | id | cnt |
   +------+-----+
   | 8    | 1   |
   | 4    | 1   |
   | 2    | 1   |
   +------+-----+
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] hengfeiyang opened a new issue, #6173: SQL with distinct will give wrong field name

Reply via email to