[ https://issues.apache.org/jira/browse/PIG-2638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Coveney updated PIG-2638: ---------------------------------- Attachment: PIG-2638-1.patch This is a newer version that is better. I basically just used the same method we use for Ints (where if it's a byte write a byte, a short white a short, and so on) instead of the previous method. This does mean that for anything larger than an int the whole long will be written, but meh. It's an improvement, and there is no performance degradation at all, so it's an easy win IMHO. Calipers output: {code} New size us linear runtime 5 30.0 ===== 10 55.5 ========== 15 82.0 =============== 20 105.4 ==================== 25 135.7 ========================== 30 156.1 ============================== Old 5 30.7 ===== 10 55.5 ========== 15 79.5 =============== 20 105.2 ==================== 25 130.4 ========================= 30 156.0 ============================== {code} The benchmark was simply serializing a tuple of size x and then immediately deserializing it. As you can see, no difference, and the version with the patch will take up less space on disk. > Optimize BinInterSedes treatment of longs > ----------------------------------------- > > Key: PIG-2638 > URL: https://issues.apache.org/jira/browse/PIG-2638 > Project: Pig > Issue Type: Improvement > Reporter: Jonathan Coveney > Assignee: Jonathan Coveney > Fix For: 0.11, 0.10.1 > > Attachments: PIG-2638-0.patch, PIG-2638-1.patch > > > During adventures in BinInterSedes, I noticed that Integers are written in an > optimized fashion, but longs are not. Given that, in the general case, we > have to write type information anyway, we might as well do the same > optimization for Longs. That is to say, given that most longs won't have 8 > bytes of information in them, why should we waste the space of serializing 8 > bytes? > This patch takes its inspiration from varint encoding per these two sources: > http://javasourcecode.org/html/open-source/mahout/mahout-0.5/org/apache/mahout/math/Varint.java.html > https://developers.google.com/protocol-buffers/docs/encoding > Though, nicely enough, we don't actually have to use varints. Since we HAVE > to write an 8 byte type header, we might as well include the number of bytes > we had to write. I use zig zag encoding so that in the case of negative > numbers, we see the benefit. > This should decrease the amount of serialized long data by a good bit. > Patch incoming. It passes test-commit in 0.11. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira