[ 
https://issues.apache.org/jira/browse/PIG-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1875:
----------------------------

    Attachment: mrtuple.patch

Here's a first pass at what MToRTuple might look like.  I've done some basic 
testing to assure this works, but nothing comprehensive.

In test runs where I serialized 100k tuples, wrote them to disk, and read them 
back I got the following results:

DefaultTuple:
time to write to disk:       81.93 sec
size on disk:                98M
time to read from disk:      12.62 sec
size in memory (after read): 238M

MToRTuple:
time to write to disk:       10.49 sec
size on disk:                58M
time to read from disk:      1.10 sec
size in memory (after read): 57M

So roughly 1/4 the memory consumption and ~10x speedup on disk reads and writes.



> Keep tuples serialized to limit spilling and speed it when it happens
> ---------------------------------------------------------------------
>
>                 Key: PIG-1875
>                 URL: https://issues.apache.org/jira/browse/PIG-1875
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Alan Gates
>            Priority: Minor
>         Attachments: mrtuple.patch
>
>
> Currently Pig reads records off of the reduce iterator and immediately 
> deserializes them into Java objects.  This takes up much more memory than 
> serialized versions, thus Pig spills sooner then if it stored them in 
> serialized form.  Also, if it does have to spill, it has to serialize them 
> again, and then again deserialize them after reading from the spill file.
> We should explore storing them in memory serialized when they are read off of 
> the reduce iterator.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to