[ https://issues.apache.org/jira/browse/SPARK-23247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-23247: ------------------------------------ Assignee: Apache Spark > combines Unsafe operations and statistics operations in Scan Data Source > ------------------------------------------------------------------------ > > Key: SPARK-23247 > URL: https://issues.apache.org/jira/browse/SPARK-23247 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.0 > Reporter: caoxuewen > Assignee: Apache Spark > Priority: Major > > Currently, we scan the execution plan of the data source, first the unsafe > operation of each row of data, and then re traverse the data for the count of > rows. In terms of performance, this is not necessary. this PR combines the > two operations and makes statistics on the number of rows while performing > the unsafe operation. > *Before modified,* > {color:#cc7832}val {color}unsafeRow = rdd.mapPartitionsWithIndexInternal { > (index{color:#cc7832}, {color}iter) => > {color:#cc7832}val {color}proj = > UnsafeProjection.create({color:#9876aa}schema{color}) > proj.initialize(index) > {color:#FF0000}iter.map(proj){color} > } > {color:#cc7832}val {color}numOutputRows = > longMetric({color:#6a8759}"numOutputRows"{color}) > unsafeRow.map { r => > {color:#FF0000}numOutputRows += {color}{color:#6897bb}{color:#FF0000}1{color} > {color} r > } > *After modified,* > val numOutputRows = longMetric("numOutputRows") > rdd.mapPartitionsWithIndexInternal { (index, iter) => > val proj = UnsafeProjection.create(schema) > proj.initialize(index) > iter.map( r => { > {color:#FF0000} numOutputRows += 1{color} > {color:#FF0000} proj(r){color} > }) > } > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org