GitHub user yanbohappy opened a pull request:
https://github.com/apache/spark/pull/3827
HiveTableScan return mutable row with copy
https://issues.apache.org/jira/browse/SPARK-4963
SchemaRDD.sample() return wrong results due to GapSamplingIterator
operating on mutable row.
HiveTableScan make RDD with SpecificMutableRow and SchemaRDD.sample() will
return GapSamplingIterator for iterating.
override def next(): T = {
val r = data.next()
advance
r
}
GapSamplingIterator.next() return the current underlying element and
assigned it to r.
However if the underlying iterator is mutable row just like what
HiveTableScan returned, underlying iterator and r will point to the same object.
After advance operation, we drop some underlying elments and it also
changed r which is not expected. Then we return the wrong value different from
initial r.
To fix this issue, the most direct way is to make HiveTableScan return
mutable row with copy just like the initial commit that I have made. This
solution will make HiveTableScan can not get the full advantage of reusable
MutableRow, but it can make sample operation return correct result.
Further more, we need to investigate GapSamplingIterator.next() and make
it can implement copy operation inside it. To achieve this, we should define
every elements that RDD can store implement the function like cloneable and it
will make huge change.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yanbohappy/spark spark-4963
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3827.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3827
----
commit 6eaee5e7b1b5aca7f6abd16892f8312c7d6d7917
Author: Yanbo Liang <[email protected]>
Date: 2014-12-29T09:00:44Z
HiveTableScan return mutable row with copy
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]