[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172226#comment-17172226 ]
Yu Gan edited comment on SPARK-12741 at 8/6/20, 10:38 AM: ---------------------------------------------------------- Aha, I came across the similar issue. My sql is select p_brand, p_size, count(ps_suppkey) as supplier_cnt from tpch.partsupp inner join tpch.part on p_partkey = ps_partkey group by P_BRAND, p_size the total row count are different: dataSet.count()=1179, dataSet.rdd().count()=1178 Finally i found the root cause: In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws BadRecordException, when in PermissiveMode and corrupted record exists the result row would be None record. In this case, the none record will be filtered. BTW, spark version 2.4 was (Author: gyustorm): Aha, I came across the similar issue. My sql is select p_brand, p_size, count(ps_suppkey) as supplier_cnt from tpch.partsupp inner join tpch.part on p_partkey = ps_partkey group by P_BRAND, p_size the total row count are different: dataSet.count()=1179, dataSet.rdd().count()=1178 Finally i found the root cause: In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws BadRecordException, when in PermissiveMode and corrupted record exists the result row would be None record. In this case, the none record will be filtered. > DataFrame count method return wrong size. > ----------------------------------------- > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Sasi > Priority: Major > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org