[
https://issues.apache.org/jira/browse/PIG-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rohini Palaniswamy updated PIG-4874:
------------------------------------
Attachment: PIG-4874-1.patch
We really take up a lot of space in memory to store the replicate table.
{code}
BufferedOutputStream bout = new BufferedOutputStream(new
FileOutputStream("/tmp/data"));
for (int i = 0; i < 100; i++) {
bout.write(new String(i + "\t" + i + "\n").getBytes());
}
bout.close();
A = LOAD '/tmp/data' as (x:int, y:int);
B = LOAD '/tmp/data' as (x:int, y:int);
C = JOIN A by (x,y), B by (x,y) using 'replicated';
STORE C into '/tmp/tezout';
{code}
The 100 entries from 0 to 99 in above test take up 508 bytes on disk. Retained
sizes in Yourkit for the replicates map are
pig.schematuple=false
Before patch : 46096
After patch : 37232
pig.schematuple=true
17808
This patch optimizes schema tuple for the case of primitive keys as well.
Currently TestSchemaTuple is only ExecType.LOCAL. Found that SchemaTuple does
not work with Tez as it requires some stuff shipped through DistributedCache
and is not done. Will create a separate jira to fix SchemaTuple for Tez and
also make it more robust so that we can try and make it default for replicate
join with such huge memory savings.
> Remove schema tuple reference overhead for replicate join hashmap
> -----------------------------------------------------------------
>
> Key: PIG-4874
> URL: https://issues.apache.org/jira/browse/PIG-4874
> Project: Pig
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
> Assignee: Rohini Palaniswamy
> Fix For: 0.16.0
>
> Attachments: PIG-4874-1.patch
>
>
> Currently even if pig.schematuple is set to false which is the default, the
> usage of TupleToMapKey and TuplesToSchemaTupleList instead of plain
> HashMap<Object, ArrayList<Tuple>> costs a lot of memory. Also key is
> currently converted to a tuple which is unnecessary.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)