[jira] [Updated] (PIG-4874) Remove schema tuple reference overhead for replicate join hashmap

Rohini Palaniswamy (JIRA) Sun, 17 Apr 2016 21:02:15 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rohini Palaniswamy updated PIG-4874:
------------------------------------
    Attachment: PIG-4874-1.patch

We really take up a lot of space in memory to store the replicate table. 

{code}
BufferedOutputStream bout = new BufferedOutputStream(new 
FileOutputStream("/tmp/data"));
        for (int i = 0; i < 100; i++) {
            bout.write(new String(i + "\t" + i + "\n").getBytes());
        }
        bout.close();

A = LOAD '/tmp/data' as (x:int, y:int);
B = LOAD '/tmp/data' as (x:int, y:int);
C = JOIN A by (x,y), B by (x,y) using 'replicated';
STORE C into '/tmp/tezout';
{code}

The 100 entries from 0 to 99 in above test take up 508 bytes on disk. Retained 
sizes in Yourkit for the replicates map are

pig.schematuple=false 
   Before patch  : 46096
   After patch  : 37232
pig.schematuple=true
     17808

This patch optimizes schema tuple for the case of primitive keys as well. 
Currently TestSchemaTuple is only ExecType.LOCAL. Found that SchemaTuple does 
not work with Tez as it requires some stuff shipped through DistributedCache 
and is not done. Will create a separate jira to fix SchemaTuple for Tez and 
also make it more robust so that we can try and make it default for replicate 
join with such huge memory savings.



> Remove schema tuple reference overhead for replicate join hashmap
> -----------------------------------------------------------------
>
>                 Key: PIG-4874
>                 URL: https://issues.apache.org/jira/browse/PIG-4874
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>            Assignee: Rohini Palaniswamy
>             Fix For: 0.16.0
>
>         Attachments: PIG-4874-1.patch
>
>
> Currently even if pig.schematuple is set to false which is the default, the 
> usage of TupleToMapKey and TuplesToSchemaTupleList instead of plain 
> HashMap<Object, ArrayList<Tuple>> costs a lot of memory.  Also key is 
> currently converted to a tuple which is unnecessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4874) Remove schema tuple reference overhead for replicate join hashmap

Reply via email to