[GitHub] [druid] n4j commented on issue #11422: Scan query gracefully closing connections which results in partial data read from datasource

GitBox Fri, 09 Jul 2021 04:13:38 -0700


n4j commented on issue #11422:
URL: https://github.com/apache/druid/issues/11422#issuecomment-877109433



   Bit of a context, I work with @bharadwajrembar and we are experiencing this 
issue in Production.
   
   We pull all the rows falling for a given time range (typically a day) via 
`ScanQuery` with `array` as response format.
   
   What we have noticed is the connection closes before all the rows for a 
given day are fetched and since this is "graceful" no exception is thrown while 
parsing the response and we experience inconsistent data pull.
   
   We have noticed the above behaviour for a datasource which has about 200 
columns and data in tune of 3 million rows for a given day (we have 365 days 
worth of data), for our other less heavy datasources it's not an issue
   
   To validate our hypothesis I wrote below two scripts, one of them counts the 
number of rows scanned from the datasource and then compares it against the 
total number of rows present obtained via `count(1)`
   
   ```bash
   
   # Driver script
   #!/bin/bash
   
   for (( i=1; i<=$N; i++ ))
   do
      #echo $i
       nohup sh -c "time ./sql.sh" > log_$i.log &
   done
   
   sleep 3
   tail -f log_*.log | grep Consumed
   ```
   ```bash
   # Executes query agains datasource and counts the number of rows returned 
and compares it against number of rows actually present
   #!/bin/bash
   
   LINES=$(curl --location --request POST 'druid-router.io/druid/v2/sql' \
   --header 'Content-Type: application/json' \
   --data-raw '
   {
       "query": "select  col1, col2, col3, col4, col200   from    
\"large-datasource\" where  __time in (?)",
       "resultFormat": "arrayLines",
       "header": "false",
       "parameters":
       [
           {
               "type": "TIMESTAMP",
               "value": "2021-05-01"
           }
       ]
   }' | wc -l)
   
   EXPECTED_LINES=$(curl --location --request POST 
'druid-router.io/druid/v2/sql' \
   --header 'Content-Type: application/json' \
   --data-raw '
   {
       "query": "select  count(1) from    \"large-datasource\" where __time in 
(?)",
       "resultFormat": "arrayLines",
       "header": "false",
       "parameters":
       [
           {
               "type": "TIMESTAMP",
               "value": "2021-05-01"
           }
       ]
   }' | tail -n 2)
   
   echo "Consumed [$LINES] rows  out of [$EXPECTED_LINES] rows for time range 
2021-05-01"
   ```
   
   Below is the output of the above script
   
   ```
   nohup: redirecting stderr to stdout
   nohup: redirecting stderr to stdout
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   
   ========
   
   
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   **Consumed [3435173] rows  out of [[3696999]] rows for time range 
2021-05-01**
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   **Consumed [3435173] rows  out of [[3696999]] rows for time range 
2021-05-01**
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   **Consumed [3435173] rows  out of [[3696999]] rows for time range 
2021-05-01**
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   Consumed [3697000] rows  out of [[3696999]] rows for time range 2021-05-01
   ```
   
   As you can see from the above we receive less number of rows then present 
via `ScanQuery` and since Druid terminates the connection gracefully, we have 
no way of knowing if we were able to read all the rows.
   
   We have checked our logs and tweaked all timeout parameters but the root 
cause of this issue remains elusive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] n4j commented on issue #11422: Scan query gracefully closing connections which results in partial data read from datasource

Reply via email to