[
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355805#comment-14355805
]
Edward Nevill commented on HADOOP-11660:
----------------------------------------
Hi,
I have added a new patch which reworks the code and (hopefully) makes it more
readable. I have also attached the resultant bulk_crc32.c as the patch is
growing rather large and difficult to understand.
I have made the following changes
1) I have removed conditionalisaton on USE_PIPELINED. I really do not
understand the condition that reads
{code}
#if (!defined(__FreeBSD__) && !defined(WINDOWS))
#define USE_PIPELINED
#endif
{code}
USE_PIPELINED is a feature of the architecture (IE. is there any benefit to
pipelining on your particular arch - at the moment x86 or aarch64), not the OS.
So I really dont understand this ifdef.
If it is the case that some arch does not support pipelining then the arch must
still implement a pipelined version, but this can be a basic shell like I have
done for the sb8 versions.
{code}
static void pipelined_crc32c_sb8(uint32_t *crc1, uint32_t *crc2, uint32_t *crc3,
const uint8_t *p_buf, size_t block_size, int
num_blocks) {
assert(num_blocks >= 1 && num_blocks <=3 && "invalid num_blocks");
*crc1 = crc32c_sb8(*crc1, p_buf, block_size);
if (num_blocks >= 2)
*crc2 = crc32c_sb8(*crc2, p_buf+block_size, block_size);
if (num_blocks >= 3)
*crc3 = crc32c_sb8(*crc3, p_buf+2*block_size, block_size);
}
{code}
So, in this case, if there is no support for pipelining, just call the non
pipelined version 3 times with each buffer in turn.
Doing this make the code a whole lot easier to read IMHO
2) The existing code initialised a variable
{code}cached_cpu_supports_crc32{code} and then did
{code}
if (likely(cached_cpu_supports_crc32)) {
crc_update_func = crc32c_hardware;
{code}
The problem with this is it creates a reference to crc32c_hardware, even when
there is no hardware support. This means it is necessary to create dummy
crc32_hardware function when we are only doing SW crc. IE.
{code}
static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t
length) {
// never called!
assert(0 && "hardware crc called on an unsupported platform");
return 0;
}
{code}
Instead, what I do is initialise static function pointers in the initialisation
function. So for example, for aarch64 I do
{code}
void __attribute__ ((constructor)) init_cpu_support_flag(void) {
unsigned long auxv = getauxval(AT_HWCAP);
if (auxv & HWCAP_CRC32) {
pipelined_crc32c_func = pipelined_crc32c;
pipelined_crc32_zlib_func = pipelined_crc32_zlib;
}
}
{code}
By default these pointers are statically initialised to the SW routines. IE.
{code}
// Satically initialise the function pointers to the software versions
// If HW is available these will be overridden by the initialisation functions
static crc_pipelined_func_t pipelined_crc32c_func = pipelined_crc32c_sb8;
static crc_pipelined_func_t pipelined_crc32_zlib_func =
pipelined_crc32_zlib_sb8;
{code}
So if nobody initialises them they fall back to SW.
3) I have removed the implementation for CRC acceleration on 32 bit x86. 32 bit
x86 now falls back to SW.
Does anyone really build hadoop for 32 bit? If yes, I will gladly put it back.
The result of these changes is to vastly reduce the amount of
conditionalisation in the code. The conditionalisation is now reduced to
{code}
#if defined(__amd64__) && defined(__GNUC__) && !defined(__FreeBSD__)
.....
#elif defined(__aarch64__) // Start ARM 64 architecture
.....
#endif
{code}
I hope the above changes will make the code a whole lot more readable,
All the best,
Ed.
> Add support for hardware crc on ARM aarch64 architecture
> --------------------------------------------------------
>
> Key: HADOOP-11660
> URL: https://issues.apache.org/jira/browse/HADOOP-11660
> Project: Hadoop Common
> Issue Type: Improvement
> Components: native
> Affects Versions: 3.0.0
> Environment: ARM aarch64 development platform
> Reporter: Edward Nevill
> Assignee: Edward Nevill
> Priority: Minor
> Labels: performance
> Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch
>
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> This patch adds support for hardware crc for ARM's new 64 bit architecture
> The patch is completely conditionalized on __aarch64__
> I have only added support for the non pipelined version as I benchmarked the
> pipelined version on aarch64 and it showed no performance improvement.
> The aarch64 version supports both Castagnoli and Zlib CRCs as both of these
> are supported on ARM aarch64 hardwre.
> To benchmark this I modified the test_bulk_crc32 test to print out the time
> taken to CRC a 1MB dataset 1000 times.
> Before:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> After:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> So this represents a 5X performance improvement on raw CRC calculation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)