[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172226#comment-17172226
 ] 

Yu Gan edited comment on SPARK-12741 at 8/6/20, 10:38 AM:
----------------------------------------------------------

Aha, I came across the similar issue. My sql is 

select
 p_brand,
 p_size,
 count(ps_suppkey) as supplier_cnt 
 from
 tpch.partsupp 
 inner join
 tpch.part 
 on p_partkey = ps_partkey 
 group by
 P_BRAND,
 p_size

the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode (default mode) and corrupted record 
exists the result row would be None record. In this case, the none record will 
be filtered. 

BTW, spark version 2.4 


was (Author: gyustorm):
Aha, I came across the similar issue. My sql is 

select
 p_brand,
 p_size,
 count(ps_suppkey) as supplier_cnt 
 from
 tpch.partsupp 
 inner join
 tpch.part 
 on p_partkey = ps_partkey 
 group by
 P_BRAND,
 p_size

the total row count are different:

dataSet.count()=1179, dataSet.rdd().count()=1178

 

Finally i found the root cause:

In org.apache.spark.sql.execution.datasources.FailureSafeParser#parse throws 
BadRecordException, when  in PermissiveMode and corrupted record exists the 
result row would be None record. In this case, the none record will be 
filtered. 

BTW, spark version 2.4 

> DataFrame count method return wrong size.
> -----------------------------------------
>
>                 Key: SPARK-12741
>                 URL: https://issues.apache.org/jira/browse/SPARK-12741
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Sasi
>            Priority: Major
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to