[ https://issues.apache.org/jira/browse/SPARK-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208068#comment-14208068 ]
Apache Spark commented on SPARK-4365: ------------------------------------- User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/3229 > Remove unnecessary filter call on records returned from parquet library > ----------------------------------------------------------------------- > > Key: SPARK-4365 > URL: https://issues.apache.org/jira/browse/SPARK-4365 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.1.0 > Reporter: Yash Datta > Priority: Minor > Fix For: 1.2.0 > > > Since parquet library has been updated , we no longer need to filter the > records returned from parquet library for null records , as now the library > skips those : > from > parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java > public boolean nextKeyValue() throws IOException, InterruptedException { > boolean recordFound = false; > while (!recordFound) { > // no more records left > if (current >= total) { return false; } > try { > checkRead(); > currentValue = recordReader.read(); > current ++; > if (recordReader.shouldSkipCurrentRecord()) { > // this record is being filtered via the filter2 package > if (DEBUG) LOG.debug("skipping record"); > continue; > } > if (currentValue == null) { > // only happens with FilteredRecordReader at end of block > current = totalCountLoadedSoFar; > if (DEBUG) LOG.debug("filtered record reader reached end of block"); > continue; > } > recordFound = true; > if (DEBUG) LOG.debug("read value: " + currentValue); > } catch (RuntimeException e) { > throw new ParquetDecodingException(format("Can not read value at %d > in block %d in file %s", current, currentBlock, file), e); > } > } > return true; > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org