GitHub user dujunling opened a pull request:
https://github.com/apache/spark/pull/22232
[SPARK-25237][SQL]remove updateBytesReadWithFileSize because we use Hadoop
FileSystem sâ¦
â¦tatistics to update the inputMetrics
## What changes were proposed in this pull request?
In FileScanRdd, we will update inputMetrics's bytesRead using
updateBytesRead every 1000 rows or when close the iterator.
but when close the iterator, we will invoke updateBytesReadWithFileSize to
increase the inputMetrics's bytesRead with file's length.
this will result in the inputMetrics's bytesRead is wrong when run the
query with limit such as select * from table limit 1.
because we do not support for Hadoop 2.5 and earlier now, we always get the
bytesRead from Hadoop FileSystem statistics other than files's length.
## How was this patch tested?
manual test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dujunling/spark fileScanRddInput
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22232.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22232
----
commit 0f75257b50a611e069d406da8d72225bb4e73b51
Author: dujunling <dujunling@...>
Date: 2018-08-25T06:20:35Z
remove updateBytesReadWithFileSize because we use Hadoop FileSystem
statistics to update the inputMetrics
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]