[ 
https://issues.apache.org/jira/browse/HADOOP-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584577#comment-14584577
 ] 

Andrew Pinski commented on HADOOP-11660:
----------------------------------------

> On aarch64 a crc32 takes 3 exec cycles, but it only has one crc unit, so if
> we did the same on aarch64 and had say
> crc32 w0, w0, x3
> crc32 w1, w1, x4
> crc32 w2, w2, x5
> The second crc32 w1, w1, x4 would have to wait for the 1st crc to complete
> and the 3rd would have to wait for the 2nd, taking 9 cycles in any case, so
> there is no benefit to pipelining.

This is not true on some AARCH64 processors.  For ThunderX, this is definitely 
not true.  crc32 (32bits) is fully pipelined and the next one will issue right 
away.
Though the latency of those instructions are 4 cycles long.  So this will only 
take 6 cycles on ThunderX.

Thanks,
Andrew Pinski

> Add support for hardware crc of HDFS checksums on ARM aarch64 architecture
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-11660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11660
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: native
>    Affects Versions: 2.8.0
>         Environment: ARM aarch64 development platform
>            Reporter: Edward Nevill
>            Assignee: Edward Nevill
>            Priority: Minor
>              Labels: performance
>             Fix For: 2.8.0
>
>         Attachments: jira-11660.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> This patch adds support for hardware crc for ARM's new 64 bit architecture
> The patch is completely conditionalized on __aarch64__
> I have only added support for the non pipelined version as I benchmarked the 
> pipelined version on aarch64 and it showed no performance improvement.
> The aarch64 version supports both Castagnoli and Zlib CRCs as both of these 
> are supported on ARM aarch64 hardwre.
> To benchmark this I modified the test_bulk_crc32 test to print out the time 
> taken to CRC a 1MB dataset 1000 times.
> Before:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 2.55
> After:
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> CRC 1048576 bytes @ 512 bytes per checksum X 1000 iterations = 0.57
> So this represents a 5X performance improvement on raw CRC calculation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to