[ https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386859#comment-14386859 ]
Edward Nevill commented on HADOOP-11660: ---------------------------------------- Hi, I have revised the patch to include the changes requested above. I have also updated test_bulk_crc32.c so it prints out the times for 16384 bytes @ 512 bytes per checksum X 1000000 iterations for both the Castagnoli and Zlib polynomials. The following are the results I get for x86_64 before and after. I have done 5 runs of each. BEFORE {code} [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.8 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.84 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.1 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.85 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.94 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.81 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. {code} AFTER {code} [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.11 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.99 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.11 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.9 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. [ed@mylittlepony hadoop]$ ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 1.12 CRC 16384 bytes @ 512 bytes per checksum X 1000000 iterations = 13.92 ./hadoop-common-project/hadoop-common/target/native/test_bulk_crc32: SUCCESS. {code} Loking at the average time over 5 runs gives BEFORE Castagnoli average = 1.116 sec Zlib average = 13.848 sec AFTER Castagnoli average = 1.116 Zlib average = 13.93 So the performance for the Castagnoli polynomial is the same. For the Zlib poynomial there seems to be a performance degradation of 0.6%. This may be due to experimental error, however this is unaccelerated in any case on x86 because it is not supported on x86 HW and is not used for HDFS. For comparison, on aarch64 partner HW I get the following averages Castagnoli = 3.586 Zlib = 3.580 Many thanks for you help with this, Ed. > Add support for hardware crc on ARM aarch64 architecture > -------------------------------------------------------- > > Key: HADOOP-11660 > URL: https://issues.apache.org/jira/browse/HADOOP-11660 > Project: Hadoop Common > Issue Type: Improvement > Components: native > Affects Versions: 3.0.0 > Environment: ARM aarch64 development platform > Reporter: Edward Nevill > Assignee: Edward Nevill > Priority: Minor > Labels: performance > Attachments: jira-11660.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > This patch adds support for hardware crc for ARM's new 64 bit architecture > The patch is completely conditionalized on __aarch64__ > I have only added support for the non pipelined version as I benchmarked the > pipelined version on aarch64 and it showed no performance improvement. > The aarch64 version supports both Castagnoli and Zlib CRCs as both of these > are supported on ARM aarch64 hardwre. > To benchmark this I modified the test_bulk_crc32 test to print out the time > taken to CRC a 1MB dataset 1000 times. > Before: > CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 > CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55 > After: > CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 > CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57 > So this represents a 5X performance improvement on raw CRC calculation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)