Github user AiHe commented on the pull request:

    https://github.com/apache/spark/pull/5587#issuecomment-94509911
  
    @JoshRosen 
    
    I just follow the example of NAStatCounter in the book of "Advanced 
Analysis with Spark". NAStatCounter is supposed to get stats of a dataset where 
some missing values exist.
    
    Instead of using scala the same as the book, I use python to reimplement 
that. My code follows.
    
    ```python
    class NAStatCounter(object):
        
        def __init__(self):
            self.stats = StatCounter()
            self.missing = 0
        
        def add(self, x):
            if not x:
                self.missing += 1
            elif isinstance(x, NAStatCounter):
                self.stats.mergeStats(x.stats)
                self.missing += x.missing
            else:
                self.stats.merge(float(str(x)))
            return self
        
        def __str__(self):
            return 'stats: ' + str(self.stats) + ' missing: ' + 
str(self.missing)
        
        def __repr__(self):
            return self.__str__()
    ```
    
    Here I make up a dummy dataset and the do the stats.
    ```python
    rdd = sc.parallelize(['1,,2', ',3,1', '1,2,'])
    na_stat = rdd.map(lambda x: x.split(','))
    z = [NAStatCounter for i in xrange(3)]
    op = lambda x, y: map(lambda a: a[0].add(a[1]), zip(x, y)
    result = na_stat.fold(z, op)
    ```
    
    Then I get the error like "'str' object has no attribute 'add'" because it 
has op('1', NAStatCounter()) in the "fold" implementation. In the specified 
lambda function, it becomes '1'.add(NAStatCounter()). However, it's expected to 
be NAStatCounter().add('1').
    
    As you mentioned, only the first argument can be modified and I guess it 
should be the provided "zeroValue" and the element are the second argument 
which is not allowed to be changed.
    
    Intuitively, users specify the "zeroValue" as "x" and elements as "y" in 
the lambda function.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to