GitHub user saucam opened a pull request:
https://github.com/apache/spark/pull/3229
SPARK-4365: Remove unnecessary filter call on records returned from parquet
library
Since parquet library has been updated , we no longer need to filter the
records returned from parquet library for null records , as now the library
skips those :
from
parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
public boolean nextKeyValue() throws IOException, InterruptedException {
boolean recordFound = false;
while (!recordFound) {
// no more records left
if (current >= total)
{ return false; }
try {
checkRead();
currentValue = recordReader.read();
current ++;
if (recordReader.shouldSkipCurrentRecord())
{
// this record is being filtered via the filter2 package
if (DEBUG) LOG.debug("skipping record");
continue;
}
if (currentValue == null)
{
// only happens with FilteredRecordReader at end of block current =
totalCountLoadedSoFar;
if (DEBUG) LOG.debug("filtered record reader reached end of block");
continue;
}
recordFound = true;
if (DEBUG) LOG.debug("read value: " + currentValue);
} catch (RuntimeException e)
{ throw new ParquetDecodingException(format("Can not read value at %d in
block %d in file %s", current, currentBlock, file), e); }
}
return true;
}
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/saucam/spark remove_filter
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3229.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3229
----
commit 8909ae921db25971259d3c4463af7af8db4a4152
Author: Yash Datta <[email protected]>
Date: 2014-11-12T14:12:12Z
SPARK-4365: Remove unnecessary filter call on records returned from parquet
library
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]