[ 
https://issues.apache.org/jira/browse/MAHOUT-1227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665680#comment-13665680
 ] 

Jake Mannix commented on MAHOUT-1227:
-------------------------------------

in fact, as I dig through all the cases, I think I've found two cases so far 
where we do "for (Element e : v)" when we really don't want to be iterating 
over all, but really only want v.nonZeroes().

For instance:  MinHashMapper:

{code}
  public void map(Text item, VectorWritable features, Context context) throws 
IOException, InterruptedException {
    Vector featureVector = features.get();
    if (featureVector.size() < minVectorSize) {
      return;
    }
    // Initialize the minhash values to highest
    for (int i = 0; i < numHashFunctions; i++) {
      minHashValues[i] = Integer.MAX_VALUE;
    }

    for (int i = 0; i < numHashFunctions; i++) {
      for (Vector.Element ele : featureVector) {
        int value = (int) ele.get();
        bytesToHash[0] = (byte) (value >> 24);
        bytesToHash[1] = (byte) (value >> 16);
        bytesToHash[2] = (byte) (value >> 8);
        bytesToHash[3] = (byte) value;
        int hashIndex = hashFunction[i].hash(bytesToHash);
        //if our new hash value is less than the old one, replace the old one
        if (minHashValues[i] > hashIndex) {
          minHashValues[i] = hashIndex;
        }
      }
    }
{code}

It works this way, sure, but if you're minhashing a sparse vector of 
cardinality 10^9, you're going to be very sad about the performance here.
                
> Vector.iterateNonZero() is super-clumsy to use: add Iterable<Element> 
> allNonZero()
> ----------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1227
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1227
>             Project: Mahout
>          Issue Type: Improvement
>         Environment: all
>            Reporter: Andy Schlaikjer
>            Assignee: Jake Mannix
>             Fix For: 0.8
>
>         Attachments: MAHOUT-1227.diff
>
>
> Currently, our codebase is littered with the following:
> {code}
> Iterator<Element> it = vector.iterateNonZero();
> while (it.hasNext()) {
>   Element e = it.next();
>   ...
> {code}
> wouldn't it be nice to be able to do:
> {code}
> for (Element e : vector.allNonZero()) {
>   ...
> {code}
> instead?
> I propose adding an Iterable<Element> allNonZero() which allow this syntactic 
> sugar.  To make it symmetric with iterateAll, let's also add 
> Iterable<Element> all(), and implement the simply in AbstractVector.
> The first diff adding this is very non-invasive - new methods added to 
> interface, implemented in the three classes from which all Vector 
> implementations derive (AbstractVector, NamedVector, and DelegatingVector).  
> User code should just work, unless they've implemented their own vector 
> without subclassing one of these three (yikes).
> Next diff, which is more invasive, would remove "extends Iterable<Element>" 
> from Vector, because using the foreach of a Vector itself is very rarely what 
> the caller really means to do (it's the all-iterator, very bad for the more 
> common sparse use case).  To achieve the same effect, the caller chooses 
> between vector.all() and vector.allNonZero(), and then they're being crystal 
> clear what they mean.
> Lastly, I'd propose we make iterateAll() and iterateAllNonZero() protected 
> methods on AbstractVector, so that we are forced to remove all the clumsy 
> places where we do Iterator<Element> it = ... all throughout the codebase.  I 
> suspect there will be very few places left that really want the raw iterator, 
> but if there are any, it can be gotten by calling 
> vector.(all/allNonZero).iterator()
> (feature-request/api fix suggestion idea courtesy of Andy Schlaikjer, 
> formalized as a proposal and posted up here by me, Jake Mannix)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to