Hi, I'm just trying to sum it up:
1) The current code: - High memory usage. - High load on the large object heap. - Limited by the available amount of memory (which might be considered a violation of the Python API). - High CPU usage when used incrementally (quadratic to the number of blocks added). 2) Optimizing with MemoryStream and lazy calculation: - High memory usage. - High load on the large object heap. - Limited by the available amount of memory (which might be considered a violation of the Python API). + Optimal CPU usage when the hash is only fetched once. ± Better than current code, but still not optimal when hash is incrementally fetched several times. 3) Optimizing with jagged arrays and lazy calculation: - High memory usage. + Improved or no impact on the large object heap (depending on the exact implementation) - Limited by the available amount of memory (which might be considered a violation of the Python API). + Optimal CPU usage when the hash is only fetched once. ± Better than current code, but still not optimal when hash is incrementally fetched several times. 4) Using the existing .NET incremental APIs + Low, constant memory usage. + No impact on the large object heap. + No limit of data length by the amount of memory. + Optimal CPU usage when the hash is only fetched once. - Breaks when hash is incrementally fetched several times (which likely is a violation of the Python API). 5) Finding or porting a different Hash implementation in C#: + Low, constant memory usage. + No impact on the large object heap. + No limit of data length by the amount of memory. + Optimal CPU usage when the hash is only fetched once. + Optimal CPU usage when the hash is incrementally fetched several times. I've a local prototype implemented for 2), but I'm not sure whether that's the best way to go... Maybe we should google for purely managed implementations of the hash codes with a sensible license... Best regards Markus Schaber CODESYS® a trademark of 3S-Smart Software Solutions GmbH Inspiring Automation Solutions ________________________________________ 3S-Smart Software Solutions GmbH Dipl.-Inf. Markus Schaber | Product Development Core Technology Memminger Str. 151 | 87439 Kempten | Germany Tel. +49-831-54031-979 | Fax +49-831-54031-50 E-Mail: m.scha...@codesys.com | Web: codesys.com | CODESYS store: store.codesys.com CODESYS forum: forum.codesys.com Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915 Von: Curt Hagenlocher [mailto:c...@hagenlocher.org] Gesendet: Dienstag, 25. Februar 2014 17:59 An: Jeff Hardy Cc: Markus Schaber; Discussion of IronPython; Emmanuel Chomarat Betreff: Re: [Ironpython-users] Hashing a directory is magnitudes slower than in cPython "Basically, there's a mismatch between what .NET provides and what Python needs for perfect compatibility." Yes. I think I remember implementing this and that's exactly the problem I ran into. I think we looked into incorporating a modified version of the BCL code directly into IronPython, but at least in those days, that was a pretty hard thing to get done. We ran into a similar issue when implementing the compression API. You could get around the problem in the client code with an "if sys.platform == 'cli'" and then use the .NET classes directly. On Tue, Feb 25, 2014 at 8:53 AM, Jeff Hardy <jdha...@gmail.com> wrote: On Tue, Feb 25, 2014 at 12:38 PM, Markus Schaber <m.scha...@codesys.com> wrote: > Hi, > > A coworker just consulted me on a performance problem of IronPython vs. > cPython. > > ... snip ... > > On a closer look, there's the additional (and IMHO much worse) problem that > the update() method seems not to work incrementally: > > private void update(IList<byte> newBytes) { > byte[] updatedBytes = new byte[_bytes.Length + newBytes.Count]; > Array.Copy(_bytes, updatedBytes, _bytes.Length); > newBytes.CopyTo(updatedBytes, _bytes.Length); > _bytes = updatedBytes; > _hash = GetHasher().ComputeHash(_bytes); > } > > In our use-case, this means that every file which is read leads to a > reallocation and copying and recalculation of the MD5 sum of all the data > which was read until now. This is suboptimal from memory and performance > perspective. > > I'm not an expert on the .NET crypto APIs, but I guess there should be some > incremental API available there which could be exploited. http://ironpython.codeplex.com/workitem/34022 I've also CC'd Emmanuel Chomarat, who was investigating a fix for this. Unfortunately I don't think there's an easy solution based on how the .NET APIs are constructed. Quoting from Emmanuel's email to me a while back: "I am now using TransformBlock / TransformBlockFinal to compute the current hash with a linear complexity ( whereas we had before n**2) but I am still facing an issue. First we need to have a copy operator, this is not possible because we can not share the hash instance between two objects in .net, the only way to make it consistent with what python is doing is by keeping a copy of the full data in MEMORY in order to create a new instance with these data when copy is called. The second thing is that digest can be called several times in python with some new data added/updated to the hash , in C# as soon as TransformBlockFinal has been called once we can not anymore add more data to the stream. Once again I have been able to use the same previous technic but at a memory cost + computation cost if we call serveral times digest/hexdigest. I don't see any to escape this pb with MS api that does not expose internal states as the underlying md5 lib in python does." Basically, there's a mismatch between what .NET provides and what Python needs for perfect compatibility. Keeping all data in memory is not desirable, but neither is failing some operations. And I would *really* prefer not to have to reimplement all of the cryptographic hash functions Python has. One option is to default to not buffering and failing on certain operations, and offer a constructor flag that enables buffering to allow the otherwise-impossible operations. Not my favourite idea, but workable. - Jeff _______________________________________________ Ironpython-users mailing list Ironpython-users@python.org https://mail.python.org/mailman/listinfo/ironpython-users _______________________________________________ Ironpython-users mailing list Ironpython-users@python.org https://mail.python.org/mailman/listinfo/ironpython-users