[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14355805#comment-14355805
 ] 

Edward Nevill commented on HADOOP-11660:
----------------------------------------

Hi,

I have added a new patch which reworks the code and (hopefully) makes it more 
readable. I have also attached the resultant bulk_crc32.c as the patch is 
growing rather large and difficult to understand.

I have made the following changes

1) I have removed conditionalisaton on USE_PIPELINED. I really do not 
understand the condition that reads

{code}
#if (!defined(__FreeBSD__) && !defined(WINDOWS))
#define USE_PIPELINED
#endif
{code}

USE_PIPELINED is a feature of the architecture (IE. is there any benefit to 
pipelining on your particular arch - at the moment x86 or aarch64), not the OS. 
So I really dont understand this ifdef.

If it is the case that some arch does not support pipelining then the arch must 
still implement a pipelined version, but this can be a basic shell like I have 
done for the sb8 versions.

{code}
static void pipelined_crc32c_sb8(uint32_t *crc1, uint32_t *crc2, uint32_t *crc3,
                                 const uint8_t *p_buf, size_t block_size, int 
num_blocks) {
  assert(num_blocks >= 1 && num_blocks <=3 && "invalid num_blocks");
  *crc1 = crc32c_sb8(*crc1, p_buf, block_size);
  if (num_blocks >= 2)
    *crc2 = crc32c_sb8(*crc2, p_buf+block_size, block_size);
  if (num_blocks >= 3)
    *crc3 = crc32c_sb8(*crc3, p_buf+2*block_size, block_size);
}
{code}

So, in this case, if there is no support for pipelining, just call the non 
pipelined version 3 times with each buffer in turn.

Doing this make the code a whole lot easier to read IMHO

2) The existing code initialised a variable 
{code}cached_cpu_supports_crc32{code} and then did

{code}
      if (likely(cached_cpu_supports_crc32)) {
        crc_update_func = crc32c_hardware;
{code}

The problem with this is it creates a reference to crc32c_hardware, even when 
there is no hardware support. This means it is necessary to create dummy 
crc32_hardware function when we are only doing SW crc. IE.

{code}
static uint32_t crc32c_hardware(uint32_t crc, const uint8_t* data, size_t 
length) {
  // never called!
  assert(0 && "hardware crc called on an unsupported platform");
  return 0;
}
{code}

Instead, what I do is initialise static function pointers in the initialisation 
function. So for example, for aarch64 I do

{code}
void __attribute__ ((constructor)) init_cpu_support_flag(void) {
  unsigned long auxv = getauxval(AT_HWCAP);
  if (auxv & HWCAP_CRC32) {
    pipelined_crc32c_func = pipelined_crc32c;
    pipelined_crc32_zlib_func = pipelined_crc32_zlib;
  }
}
{code}

By default these pointers are statically initialised to the SW routines. IE.

{code}
// Satically initialise the function pointers to the software versions
// If HW is available these will be overridden by the initialisation functions
static crc_pipelined_func_t pipelined_crc32c_func = pipelined_crc32c_sb8;
static crc_pipelined_func_t pipelined_crc32_zlib_func = 
pipelined_crc32_zlib_sb8;
{code}

So if nobody initialises them they fall back to SW.

3) I have removed the implementation for CRC acceleration on 32 bit x86. 32 bit 
x86 now falls back to SW.

Does anyone really build hadoop for 32 bit? If yes, I will gladly put it back.

The result of these changes is to vastly reduce the amount of 
conditionalisation in the code. The conditionalisation is now reduced to

{code}
#if defined(__amd64__) && defined(__GNUC__) && !defined(__FreeBSD__)
.....
#elif defined(__aarch64__) // Start ARM 64 architecture
.....
#endif
{code}

I hope the above changes will make the code a whole lot more readable,

All the best,
Ed.


> Add support for hardware crc on ARM aarch64 architecture
> --------------------------------------------------------
>
>                 Key: HADOOP-11660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11660
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: native
>    Affects Versions: 3.0.0
>         Environment: ARM aarch64 development platform
>            Reporter: Edward Nevill
>            Assignee: Edward Nevill
>            Priority: Minor
>              Labels: performance
>         Attachments: bulk_crc32.c, jira-11660-new.patch, jira-11660.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This patch adds support for hardware crc for ARM's new 64 bit architecture
> The patch is completely conditionalized on __aarch64__
> I have only added support for the non pipelined version as I benchmarked the 
> pipelined version on aarch64 and it showed no performance improvement.
> The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
> are supported on ARM aarch64 hardwre.
> To benchmark this I modified the test_bulk_crc32 test to print out the time 
> taken to CRC a 1MB dataset 1000 times.
> Before:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> After:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to