n4j commented on issue #11422:
URL: https://github.com/apache/druid/issues/11422#issuecomment-877109433
Bit of a context, I work with @bharadwajrembar and we are experiencing this
issue in Production.
We pull all the rows falling for a given time range (typically a day) via
`ScanQuery` with `array` as response format.
What we have noticed is the connection closes before all the rows for a
given day are fetched and since this is "graceful" no exception is thrown while
parsing the response and we experience inconsistent data pull.
We have noticed the above behaviour for a datasource which has about 200
columns and data in tune of 3 million rows for a given day (we have 365 days
worth of data), for our other less heavy datasources it's not an issue
To validate our hypothesis I wrote below two scripts, one of them counts the
number of rows scanned from the datasource and then compares it against the
total number of rows present obtained via `count(1)`
```bash
# Driver script
#!/bin/bash
for (( i=1; i<=$N; i++ ))
do
#echo $i
nohup sh -c "time ./sql.sh" > log_$i.log &
done
sleep 3
tail -f log_*.log | grep Consumed
```
```bash
# Executes query agains datasource and counts the number of rows returned
and compares it against number of rows actually present
#!/bin/bash
LINES=$(curl --location --request POST 'druid-router.io/druid/v2/sql' \
--header 'Content-Type: application/json' \
--data-raw '
{
"query": "select col1, col2, col3, col4, col200 from
\"large-datasource\" where __time in (?)",
"resultFormat": "arrayLines",
"header": "false",
"parameters":
[
{
"type": "TIMESTAMP",
"value": "2021-05-01"
}
]
}' | wc -l)
EXPECTED_LINES=$(curl --location --request POST
'druid-router.io/druid/v2/sql' \
--header 'Content-Type: application/json' \
--data-raw '
{
"query": "select count(1) from \"large-datasource\" where __time in
(?)",
"resultFormat": "arrayLines",
"header": "false",
"parameters":
[
{
"type": "TIMESTAMP",
"value": "2021-05-01"
}
]
}' | tail -n 2)
echo "Consumed [$LINES] rows out of [$EXPECTED_LINES] rows for time range
2021-05-01"
```
Below is the output of the above script
```
nohup: redirecting stderr to stdout
nohup: redirecting stderr to stdout
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
========
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
**Consumed [3435173] rows out of [[3696999]] rows for time range
2021-05-01**
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
**Consumed [3435173] rows out of [[3696999]] rows for time range
2021-05-01**
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
**Consumed [3435173] rows out of [[3696999]] rows for time range
2021-05-01**
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
Consumed [3697000] rows out of [[3696999]] rows for time range 2021-05-01
```
As you can see from the above we receive less number of rows then present
via `ScanQuery` and since Druid terminates the connection gracefully, we have
no way of knowing if we were able to read all the rows.
We have checked our logs and tweaked all timeout parameters but the root
cause of this issue remains elusive.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]