viirya commented on issue #4024:
URL: 
https://github.com/apache/arrow-datafusion/issues/4024#issuecomment-1299390821

   > ❯ create external table lineitem stored as parquet location 
'/mnt/bigdata/tpch/sf1-parquet/lineitem';
   > 0 rows in set. Query took 0.015 seconds.
   > 
   > ❯ select count(*) from lineitem where l_discount between 0.05 and 0.07;
   > +-----------------+
   > | COUNT(UInt8(1)) |
   > +-----------------+
   > | 16361562        |
   > +-----------------+
   > 1 row in set. Query took 0.487 seconds.
   > 
   > ❯ select count(*) from lineitem where l_discount between 0.06-0.01 and 
0.06+0.01;
   > +-----------------+
   > | COUNT(UInt8(1)) |
   > +-----------------+
   > | 10908630        |
   > +-----------------+
   > 1 row in set. Query took 0.394 seconds.
   > So `between 0.05 and 0.07` is consistent between Spark and DataFusion, but 
`between 0.06-0.01 and 0.06+0.01` is not.
   
   This is understandable. This is because 0.06 is parsed as double type in 
DataFusion but as decimal type in Spark. 0.06 - 0.01 will result in 0.049999999 
instead of 0.05 if they treated as double type literal (no matter DataFusion or 
Spark) due to precision of floating point value.
   
   More interestingly, for the query `select count(*) from lineitem where 
l_discount between 0.05 and 0.07;`, I only got `1637557` while running the 
query on DataFusion with the data generated by `./tpch-gen.sh 1.0`.
   
   I assume the query execution is correct. Then I will suspect that if the 
data generated by `tpch-gen.sh` is the same as ` 
`'/mnt/bigdata/tpch/sf1-parquet/lineitem'`.
   
   I also suspect that `123140554.79` is correct answer for the query on this 
data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to