Github user AiHe commented on the pull request:
https://github.com/apache/spark/pull/5587#issuecomment-94509911
@JoshRosen
I just follow the example of NAStatCounter in the book of "Advanced
Analysis with Spark". NAStatCounter is supposed to get stats of a dataset where
some missing values exist.
Instead of using scala the same as the book, I use python to reimplement
that. My code follows.
```python
class NAStatCounter(object):
def __init__(self):
self.stats = StatCounter()
self.missing = 0
def add(self, x):
if not x:
self.missing += 1
elif isinstance(x, NAStatCounter):
self.stats.mergeStats(x.stats)
self.missing += x.missing
else:
self.stats.merge(float(str(x)))
return self
def __str__(self):
return 'stats: ' + str(self.stats) + ' missing: ' +
str(self.missing)
def __repr__(self):
return self.__str__()
```
Here I make up a dummy dataset and the do the stats.
```python
rdd = sc.parallelize(['1,,2', ',3,1', '1,2,'])
na_stat = rdd.map(lambda x: x.split(','))
z = [NAStatCounter for i in xrange(3)]
op = lambda x, y: map(lambda a: a[0].add(a[1]), zip(x, y)
result = na_stat.fold(z, op)
```
Then I get the error like "'str' object has no attribute 'add'" because it
has op('1', NAStatCounter()) in the "fold" implementation. In the specified
lambda function, it becomes '1'.add(NAStatCounter()). However, it's expected to
be NAStatCounter().add('1').
As you mentioned, only the first argument can be modified and I guess it
should be the provided "zeroValue" and the element are the second argument
which is not allowed to be changed.
Intuitively, users specify the "zeroValue" as "x" and elements as "y" in
the lambda function.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]