[
https://issues.apache.org/jira/browse/SPARK-40831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17619350#comment-17619350
]
rohan commented on SPARK-40831:
--------------------------------
using these methods to calculate the data frame size and based on that using
repartition and coalesce functions.
*public static double sourceDataFrameSize(Dataset<?> df) {*
*Statistics stats = df.queryExecution().logical().stats();*
*return stats.sizeInBytes().longValue();*
*}*
*public static int blockFactor(Dataset<?> df) {*
*double dfSize = sourceDataFrameSize(df);*
*log.info("Data Frame Size of Source File : " + dfSize);*
*long defaultBlockSize =
Long.parseLong(getDefaultHadoopConfigs().get("dfs.blocksize"));*
*log.info("Default Optimized Block Size : " + defaultBlockSize);*
*return (int) (dfSize / defaultBlockSize) == 0* *? 1* *: (int)
(Math.ceil(dfSize / defaultBlockSize));*
*}*
> LogicalPlanStats is throwing an exception
> ------------------------------------------
>
> Key: SPARK-40831
> URL: https://issues.apache.org/jira/browse/SPARK-40831
> Project: Spark
> Issue Type: Bug
> Components: Java API
> Affects Versions: 3.2.0
> Environment: Production
> Reporter: rohan
> Priority: Major
>
> java.lang.UnsupportedOperationException at
> org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats(LogicalPlan.scala:175)
> at
> org.apache.spark.sql.catalyst.plans.logical.LeafNode.computeStats$(LogicalPlan.scala:175)
> at
> org.apache.spark.sql.catalyst.analysis.UnresolvedRelation.computeStats(unresolved.scala:46)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:68)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.default(SizeInBytesOnlyStatsPlanVisitor.scala:29)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:47)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:29)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
> at scala.Option.getOrElse(Option.scala:189) at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:39)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:236)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitFilter(SizeInBytesOnlyStatsPlanVisitor.scala:29)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:30)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:29)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
> at scala.Option.getOrElse(Option.scala:189) at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitUnaryNode(SizeInBytesOnlyStatsPlanVisitor.scala:39)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:294)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visitProject(SizeInBytesOnlyStatsPlanVisitor.scala:29)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit(LogicalPlanVisitor.scala:37)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor.visit$(LogicalPlanVisitor.scala:25)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.SizeInBytesOnlyStatsPlanVisitor$.visit(SizeInBytesOnlyStatsPlanVisitor.scala:29)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.$anonfun$stats$1(LogicalPlanStats.scala:37)
> at scala.Option.getOrElse(Option.scala:189) at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats(LogicalPlanStats.scala:33)
> at
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats.stats$(LogicalPlanStats.scala:33)
> at
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at
> com.modeln.datahub.util.CMnDataHubUtil.sourceDataFrameSize(CMnDataHubUtil.java:198)
> at
> com.modeln.datahub.util.CMnDataHubUtil.blockFactor(CMnDataHubUtil.java:203)
> at
> com.modeln.datahub.entity.CMnHiveDeltaEntityHandler.saveAsTable(CMnHiveDeltaEntityHandler.java:102)
> at
> com.modeln.datahub.entity.CMnHiveDeltaEntityHandler.append(CMnHiveDeltaEntityHandler.java:44)
> at
> com.modeln.datahub.pipeline.core.CMnBasePipeline.writeEntity(CMnBasePipeline.java:267)
> at
> com.modeln.cdm.siso.pipelines.SisoIngestionPipeline.ingestFiles(SisoIngestionPipeline.java:308)
> at
> com.modeln.cdm.siso.pipelines.SisoIngestionPipeline.process(SisoIngestionPipeline.java:117)
> at
> com.modeln.datahub.pipeline.core.CMnPipelineApplication.main(CMnPipelineApplication.java:117)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498) at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:740)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]