Hi,

A coworker just consulted me on a performance problem of IronPython vs. cPython.

The attached test script reproduces the problem. 

On cPython 2.7.6, it needs about 1.5 seconds on our test directory (once the OS 
disk cache is hot), and cPython 3.3 needs about 1.7 seconds, while IronPython 
needs more than 10 minutes(!).

C:\Users\m.schaber>c:\Python33\python.exe d:\crc-fixed.py "c:\Test 
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.721932103156255

C:\Users\m.schaber>c:\Python33\python.exe d:\crc-fixed.py "c:\Test 
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.7523154039322837

C:\Users\m.schaber>python d:\crc-fixed.py "c:\Test 
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.44541429616

C:\Users\m.schaber>python d:\crc-fixed.py "c:\Test 
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.40604227074

C:\Users\m.schaber>"c:\Program Files (x86)\IronPython 2.7\ipy.exe" d:\crc.py 
"c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 602.745100044

C:\Users\m.schaber>"c:\Program Files (x86)\IronPython 2.7\ipy.exe" d:\crc.py 
"c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 607.252915722


My first guess was that it's a problem of the cPython 8 Bit strings vs. .NET 
strings, which cause expensive conversions. (I also guess that a Python 3 based 
IronPython will fix this issue.)

One idea to fix this may be to add an overload to MD5Type.update() which 
directly accepts strings (and maybe one accepting byte arrays), to avoid the 
call to the conversion functions.

On a closer look, there's the additional (and IMHO much worse) problem that the 
update() method seems not to work incrementally:

private void update(IList<byte> newBytes) {
    byte[] updatedBytes = new byte[_bytes.Length + newBytes.Count];
    Array.Copy(_bytes, updatedBytes, _bytes.Length);
    newBytes.CopyTo(updatedBytes, _bytes.Length);
    _bytes = updatedBytes;
    _hash = GetHasher().ComputeHash(_bytes);
}

In our use-case, this means that every file which is read leads to a 
reallocation and copying and recalculation of the MD5 sum of all the data which 
was read until now. This is suboptimal from memory and performance perspective.

I'm not an expert on the .NET crypto APIs, but I guess there should be some 
incremental API available there which could be exploited.

If not, we could try to find a suitable pure .NET implementation like 
http://archive.msdn.microsoft.com/SilverlightMD5.

A less intrusive workaround may be to collect the bytes using a MemoryStream, 
and feeding it to ComputeHash() only on demand, when someone actually requests 
the hash result via digest() or hexdigest().


PS: Our use-case of MD5 is purely for technical data integrity, not against 
malicious users, cryptographic security is not required.

Best regards

Markus Schaber

CODESYS(r) a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: m.scha...@codesys.com | Web: http://www.codesys.com | CODESYS store: 
http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade 
register: Kempten HRB 6186 | Tax ID No.: DE 167014915

_______________________________________________
Ironpython-users mailing list
Ironpython-users@python.org
https://mail.python.org/mailman/listinfo/ironpython-users

Reply via email to