[ 
https://issues.apache.org/jira/browse/PIG-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated PIG-3409:
------------------------

    Description: 
I've met serious perfomance issue.
please see visualvm screenshot.

Here is hashCode implementation from the class:

{code}
 @Override
    public int hashCode() {
        int hash = 17;
        for (Iterator<Object> it = mFields.iterator(); it.hasNext();) {
            Object o = it.next();
            if (o != null) {
                hash = 31 * hash + o.hashCode();
            }
        }
        return hash;
    }
{code}

I don't see any reason here to iterate over the whole tuple, aggregate hash 
value and then return it.

I can fix it, if it's possible to take part in dev process. I'm new to it :(

The idea for any join:
If we have a plan we know for sure which relations would be joined.
It means that we can precalculate hashcode values.
The difference is: m+n hashcode calculations or m*n (current implementation).
It think it should bring significant perfomance boost.

  was:
I've met serious perfomance issue.
please see visualvm screenshot.

Here is hashCode implementation from the class:

{code}
 @Override
    public int hashCode() {
        int hash = 17;
        for (Iterator<Object> it = mFields.iterator(); it.hasNext();) {
            Object o = it.next();
            if (o != null) {
                hash = 31 * hash + o.hashCode();
            }
        }
        return hash;
    }
{code}

I don't see any reason here to iterate over the whole tuple, aggregate hash 
value and then return it.

I can fix it, if it's possible to take part in dev process. I'm new to it :(

    
> org.apache.pig.data.DefaultTuple hashcode perfomance issue
> ----------------------------------------------------------
>
>                 Key: PIG-3409
>                 URL: https://issues.apache.org/jira/browse/PIG-3409
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: 0.11
>            Reporter: Sergey
>            Priority: Critical
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> I've met serious perfomance issue.
> please see visualvm screenshot.
> Here is hashCode implementation from the class:
> {code}
>  @Override
>     public int hashCode() {
>         int hash = 17;
>         for (Iterator<Object> it = mFields.iterator(); it.hasNext();) {
>             Object o = it.next();
>             if (o != null) {
>                 hash = 31 * hash + o.hashCode();
>             }
>         }
>         return hash;
>     }
> {code}
> I don't see any reason here to iterate over the whole tuple, aggregate hash 
> value and then return it.
> I can fix it, if it's possible to take part in dev process. I'm new to it :(
> The idea for any join:
> If we have a plan we know for sure which relations would be joined.
> It means that we can precalculate hashcode values.
> The difference is: m+n hashcode calculations or m*n (current implementation).
> It think it should bring significant perfomance boost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to