Hi,
A coworker just consulted me on a performance problem of IronPython vs. cPython.
The attached test script reproduces the problem.
On cPython 2.7.6, it needs about 1.5 seconds on our test directory (once the OS
disk cache is hot), and cPython 3.3 needs about 1.7 seconds, while IronPython
needs more than 10 minutes(!).
C:\Users\m.schaber>c:\Python33\python.exe d:\crc-fixed.py "c:\Test
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.721932103156255
C:\Users\m.schaber>c:\Python33\python.exe d:\crc-fixed.py "c:\Test
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.7523154039322837
C:\Users\m.schaber>python d:\crc-fixed.py "c:\Test
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.44541429616
C:\Users\m.schaber>python d:\crc-fixed.py "c:\Test
Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.40604227074
C:\Users\m.schaber>"c:\Program Files (x86)\IronPython 2.7\ipy.exe" d:\crc.py
"c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 602.745100044
C:\Users\m.schaber>"c:\Program Files (x86)\IronPython 2.7\ipy.exe" d:\crc.py
"c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 607.252915722
My first guess was that it's a problem of the cPython 8 Bit strings vs. .NET
strings, which cause expensive conversions. (I also guess that a Python 3 based
IronPython will fix this issue.)
One idea to fix this may be to add an overload to MD5Type.update() which
directly accepts strings (and maybe one accepting byte arrays), to avoid the
call to the conversion functions.
On a closer look, there's the additional (and IMHO much worse) problem that the
update() method seems not to work incrementally:
private void update(IList<byte> newBytes) {
byte[] updatedBytes = new byte[_bytes.Length + newBytes.Count];
Array.Copy(_bytes, updatedBytes, _bytes.Length);
newBytes.CopyTo(updatedBytes, _bytes.Length);
_bytes = updatedBytes;
_hash = GetHasher().ComputeHash(_bytes);
}
In our use-case, this means that every file which is read leads to a
reallocation and copying and recalculation of the MD5 sum of all the data which
was read until now. This is suboptimal from memory and performance perspective.
I'm not an expert on the .NET crypto APIs, but I guess there should be some
incremental API available there which could be exploited.
If not, we could try to find a suitable pure .NET implementation like
http://archive.msdn.microsoft.com/SilverlightMD5.
A less intrusive workaround may be to collect the bytes using a MemoryStream,
and feeding it to ComputeHash() only on demand, when someone actually requests
the hash result via digest() or hexdigest().
PS: Our use-case of MD5 is purely for technical data integrity, not against
malicious users, cryptographic security is not required.
Best regards
Markus Schaber
CODESYS(r) a trademark of 3S-Smart Software Solutions GmbH
Inspiring Automation Solutions
3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50
E-Mail: [email protected] | Web: http://www.codesys.com | CODESYS store:
http://store.codesys.com
CODESYS forum: http://forum.codesys.com
Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade
register: Kempten HRB 6186 | Tax ID No.: DE 167014915
_______________________________________________
Ironpython-users mailing list
[email protected]
https://mail.python.org/mailman/listinfo/ironpython-users