[
https://issues.apache.org/jira/browse/HADOOP-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734445#action_12734445
]
Scott Carey commented on HADOOP-6166:
-------------------------------------
One could take Intel's idea, and go 8 (or 6 or 12?) bytes at a time instead of
4. This would line up with the 12 bit lookup tables a bit better. 6 bytes
might be the easier boundary, and require 4 12 bit lookup tables. which is 32K,
the size of the tables in the inner loop in your "4_3" are 17K, and the current
version is 4K.
Going 12 bytes at a time would require 8 tables and 64K of space, and then
we're randomly jumping around lookup tables that don't fit in a L1 D-cache on
some processors.
The 8 bytes at a time approach is in C code (BSD license) here:
http://sourceforge.net/projects/slicing-by-8/files/
The trick with going over 4 bytes in a loop has to do with how the cycle works
on the CRC. I think, after the first four bytes of lookups, it changes a bit.
But I didn't read that much into that code to be sure of what it is doing.
They do it with 8 1K lookup tables each of one byte indexes. They also get to
directly Xor out of the byte array a single int, rather than grabbing one byte
at a time and shifting. We can do that with a ByteBuffer with getInt(), but
when I tried that (with getInt) in the previous case the byte buffer creation
overhead was too large, and the ByteBuffer access seemed very inefficient for
some reason. Oh how nice it would be if you could grab the next 4 bytes in a
byte[] as an Int (or the next 8 as a Long) without wrapping it in a ByteBuffer,
and let the compiler figure out the optimal processor load instruction.
I think a 6 byte at a time, 4 lookup of 12 bit LUT would be like this:
{code}
public class Crc32_6_4 extends Crc32Base {
/** {...@inheritdoc} */
public void update(byte[] b, int off, int len) {
while(len > 3) {
crc ^= b[off++] & 0xff;
crc ^= (b[off++] & 0xff) << 8;
crc ^= (b[off++] & 0xff) << 16;
crc ^= (b[off++] & 0xff) << 24;
int c0 = b[off++] & 0xff;;
c0 ^= (b[off++] & 0xff) << 8;
crc = T12_3[crc & 0xfff] ^ T12_2[(crc >>> 12) & 0xfff] ^ T12_1[((crc >>>
24) & (c0 << 8)) & 0xfff] ^ T12_0[c0 >> 4];
len -= 6;
}
for (; len > 0; len--) {
crc = (crc >>> 8) ^ Table8_0[(crc ^ b[off++]) & 0xff];
}
}
{code}
Assuming the tables were built right and the "wrap past 4 bytes means don't xor
with crc" is correct. I haven't tried this at all.
All of these with larger lookup tables run the risk of performing worse under
concurrency, even if faster single threaded. Cache pressure is greater under
concurrency. We might want to use the benchmark in HADOOP-5318, which is
heavily CRC reliant, as a check to make sure we aren't regressing under higher
concurrency due to cache pressure.
For reference, Intel's C code (referenced above, snippet below) with 8 tables
looks like this in the inner loop:
{code}
/*++
*
* Copyright (c) 2004-2006 Intel Corporation - All Rights Reserved
*
* This software program is licensed subject to the BSD License,
* available at http://www.opensource.org/licenses/bsd-license.html
*
* Abstract: The main routine
*
--*/
crc32c_sb8_64_bit(
uint32_t* p_running_crc,
const uint8_t* p_buf,
const uint32_t length,
const uint32_t init_bytes,
uint8_t mode)
{
uint32_t li;
uint32_t crc, term1, term2;
uint32_t running_length;
uint32_t end_bytes;
if(mode == MODE_CONT)
crc = *p_running_crc;
else
crc = CRC32C_INIT_REFLECTED;
running_length = ((length - init_bytes)/8)*8;
end_bytes = length - init_bytes - running_length;
for(li=0; li < init_bytes; li++)
crc = crc_tableil8_o32[(crc ^ *p_buf++) & 0x000000FF] ^ (crc >>
8);
for(li=0; li < running_length/8; li++)
{
crc ^= *(uint32_t *)p_buf;
p_buf += 4;
term1 = crc_tableil8_o88[crc & 0x000000FF] ^
crc_tableil8_o80[(crc >> 8) & 0x000000FF];
term2 = crc >> 16;
crc = term1 ^
crc_tableil8_o72[term2 & 0x000000FF] ^
crc_tableil8_o64[(term2 >> 8) & 0x000000FF];
term1 = crc_tableil8_o56[(*(uint32_t *)p_buf) & 0x000000FF] ^
crc_tableil8_o48[((*(uint32_t *)p_buf) >> 8) &
0x000000FF];
term2 = (*(uint32_t *)p_buf) >> 16;
crc = crc ^
term1 ^
crc_tableil8_o40[term2 & 0x000000FF] ^
crc_tableil8_o32[(term2 >> 8) & 0x000000FF];
p_buf += 4;
}
for(li=0; li < end_bytes; li++)
crc = crc_tableil8_o32[(crc ^ *p_buf++) & 0x000000FF] ^ (crc >>
8);
if((mode == MODE_BEGIN) || (mode == MODE_CONT))
return crc;
return crc ^ XOROT;
}
{code}
That is pretty straightforward to implement if the next 4 tables (T5, T6, T7,
T8 in the terminology of the current PureJavaCRC32) are generated the same way
as the previous 4.
Intel makes sure the inner loop is on an 8 byte boundary, because the C
compiler can make the load needed for the
{code}crc ^= *(uint32_t *)p_buf{code} part faster if that is the case. They
also tend to favor shifting by 16 and 8 and avoiding shifting by 24 for some
reason.
I may try out the eight bytes at once with 8 lookup tables version next week.
> Improve PureJavaCrc32
> ---------------------
>
> Key: HADOOP-6166
> URL: https://issues.apache.org/jira/browse/HADOOP-6166
> Project: Hadoop Common
> Issue Type: Improvement
> Components: util
> Reporter: Tsz Wo (Nicholas), SZE
> Assignee: Tsz Wo (Nicholas), SZE
> Attachments: c6166_20090722.patch, c6166_20090722_benchmark_32VM.txt,
> c6166_20090722_benchmark_64VM.txt
>
>
> Got some ideas to improve CRC32 calculation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.