Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-06 Thread Andrew Anderson
My concern would be more that given proven weaknesses in MD5, do I want to risk 
that 1 in a billion chance that the “right” bit error creeps into an archive 
that manages to not impact the checksum, thus creating the illusion that the 
archive integrity has not been violated?

-- 
Andrew Anderson, Director of Development, Library and Information Resources 
Network, Inc.
http://www.lirn.net/ | http://www.twitter.com/LIRNnotes | 
http://www.facebook.com/LIRNnotes

On Oct 2, 2014, at 18:34, Jonathan Rochkind  wrote:

> For checksums for ensuring archival integrity, are cryptographic flaws 
> relavent? I'm not sure, is part of the point of a checksum to ensure against 
> _malicious_ changes to files?  I honestly don't know. (But in most systems, 
> I'd guess anyone who had access to maliciously change the file would also 
> have access to maliciously change the checksum!)
> 
> Rot13 is not suitable as a checksum for ensuring archival integrity however, 
> because it's output is no smaller than it's input, which is kind of what 
> you're looking for. 
> 
> 
> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Cary Gordon 
> [listu...@chillco.com]
> Sent: Thursday, October 02, 2014 5:51 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] What is the real impact of SHA-256? - Updated
> 
> +1
> 
> MD5 is little better than ROT13. At least with ROT13, you have no illusions.
> 
> We use SHA 512 for most work. We don't do finance or national security, so it 
> is a good fit for us.
> 
> Cary
> 
> On Oct 2, 2014, at 12:30 PM, Simon Spero  wrote:
> 
>> Intel skylake processors have dedicated sha instructions.
>> See: https://software.intel.com/en-us/articles/intel-sha-extensions
>> 
>> Using a tree hash approach (which is inherently embarrassingly parallel)
>> will leave io time dominant. This approach is used by Amazon glacier - see
>> http://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html
>> 
>> MD5 is broken, and cannot be used for any security purposes. It cannot be
>> used for deduplication if any of the files are in the directories of
>> security researchers!
>> 
>> If security is not a concern then there are many faster hashing algorithms
>> that avoid the costs imposed by the need to defend against adversaries.
>> See siphash, murmur, cityhash, etc.
>> 
>> Simon
>> On Oct 2, 2014 11:18 AM, "Alex Duryee"  wrote:
>> 
>>> Despite some of its relative flaws, MD5 is frequently selected over SHA-256
>>> in archives as the checksum algorithm of choice. One of the primary factors
>>> here is the longer processing time required for SHA-256, though there have
>>> been no empirical studies calculating that time difference and its overall
>>> impact on checksum generation and verification in a preservation
>>> environment.
>>> 
>>> AVPreserve Consultant Alex Duryee recently ran a series of tests comparing
>>> the real time and cpu time used by each algorithm. His newly updated white
>>> paper "What Is the Real Impact of SHA-256?" presents the results and comes
>>> to some interesting conclusions regarding the actual time difference
>>> between the two and what other factors may have a greater impact on your
>>> selection decision and file monitoring workflow. The paper can be
>>> downloaded for free at
>>> 
>>> http://www.avpreserve.com/papers-and-presentations/whats-the-real-impact-of-sha-256/
>>> .
>>> __
>>> 
>>> Alex Duryee
>>> *AVPreserve*
>>> 350 7th Ave., Suite 1605
>>> New York, NY 10001
>>> 
>>> office: 917-475-9630
>>> 
>>> http://www.avpreserve.com
>>> Facebook.com/AVPreserve <http://facebook.com/AVPreserve>
>>> twitter.com/AVPreserve
>>> 


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-04 Thread Simon Spero
There are published papers on MD5 collisions, with associated examples.

Researchers at http://isi.jhu.edu are quite likely to have read and
downloaded them.
E. G.
http://www.forensicfocus.com/Content/pid=87/page=2/
On Oct 3, 2014 3:05 PM, "Alexander Duryee" 
wrote:

> Simon - do you have any examples of MD5 collisions in JHU's collections?
> The chance of that occurring is vanishingly small (
> http://prezi.com/zfyebvaelksh/fixity-20/) so I'm curious what produced the
> collision, and how often.
>
> On Fri, Oct 3, 2014 at 12:14 PM, Kyle Banerjee 
> wrote:
>
> > On Fri, Oct 3, 2014 at 7:26 AM, Charles Blair  wrote:
> >
> > > Look at slide 15 here:
> > > http://www.slideshare.net/DuraSpace/sds-cwebinar-1
> > >
> > > I think we're worried about the cumulative effect over time of
> > > undetected errors (at least, I am).
> >
> >
> > This slide shows that data loss via drive fault is extremely rare. Note
> > that a bit getting flipped is usually harmless. However, I do believe
> that
> > data corruption via other avenues will be considerably more common.
> >
> > My point is that the use case for libraries is generally weak and the
> > solution is very expensive -- don't forget the authenticity checks must
> > also be done on the "good" files. As you start dealing with more and more
> > data, this system is not sustainable for the simple reason that
> maintained
> > disk space costs a fortune and network capacity is a bottleneck. It's no
> > big deal to do this on a few TB since our repositories don't have to
> worry
> > about the integrity of dynamic data, but you eventually get to a point
> > where it sucks up too many systems resources and consumes too much
> > expertise.
> >
> > Authoritative files really should be offline but if online access to
> > authoritative files is seen as an imperative, it at least makes more
> sense
> > to just do something like dump it all in Glacier and slowly refresh
> > everything you own with authoritative copy. Or better yet, just leave the
> > stuff there and just make new derivatives when there is any reason to
> > believe the existing ones are not good.
> >
> > While I think integrity is an issue, I think other deficiencies in
> > repositories are  more pressing. Except for the simplest use cases,
> getting
> > stuff in or out of them is a hopeless process even with automated
> > assistance. Metadata and maintenance aren't very good either. That you
> > still need coding skills to get popular platforms that have been in use
> for
> > many years to ingest and serve up things as simple as documents and
> images
> > speaks volumes.
> >
> > kyle
> >
>


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Alexander Duryee
Simon - do you have any examples of MD5 collisions in JHU's collections?
The chance of that occurring is vanishingly small (
http://prezi.com/zfyebvaelksh/fixity-20/) so I'm curious what produced the
collision, and how often.

On Fri, Oct 3, 2014 at 12:14 PM, Kyle Banerjee 
wrote:

> On Fri, Oct 3, 2014 at 7:26 AM, Charles Blair  wrote:
>
> > Look at slide 15 here:
> > http://www.slideshare.net/DuraSpace/sds-cwebinar-1
> >
> > I think we're worried about the cumulative effect over time of
> > undetected errors (at least, I am).
>
>
> This slide shows that data loss via drive fault is extremely rare. Note
> that a bit getting flipped is usually harmless. However, I do believe that
> data corruption via other avenues will be considerably more common.
>
> My point is that the use case for libraries is generally weak and the
> solution is very expensive -- don't forget the authenticity checks must
> also be done on the "good" files. As you start dealing with more and more
> data, this system is not sustainable for the simple reason that maintained
> disk space costs a fortune and network capacity is a bottleneck. It's no
> big deal to do this on a few TB since our repositories don't have to worry
> about the integrity of dynamic data, but you eventually get to a point
> where it sucks up too many systems resources and consumes too much
> expertise.
>
> Authoritative files really should be offline but if online access to
> authoritative files is seen as an imperative, it at least makes more sense
> to just do something like dump it all in Glacier and slowly refresh
> everything you own with authoritative copy. Or better yet, just leave the
> stuff there and just make new derivatives when there is any reason to
> believe the existing ones are not good.
>
> While I think integrity is an issue, I think other deficiencies in
> repositories are  more pressing. Except for the simplest use cases, getting
> stuff in or out of them is a hopeless process even with automated
> assistance. Metadata and maintenance aren't very good either. That you
> still need coding skills to get popular platforms that have been in use for
> many years to ingest and serve up things as simple as documents and images
> speaks volumes.
>
> kyle
>


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Kyle Banerjee
On Fri, Oct 3, 2014 at 7:26 AM, Charles Blair  wrote:

> Look at slide 15 here:
> http://www.slideshare.net/DuraSpace/sds-cwebinar-1
>
> I think we're worried about the cumulative effect over time of
> undetected errors (at least, I am).


This slide shows that data loss via drive fault is extremely rare. Note
that a bit getting flipped is usually harmless. However, I do believe that
data corruption via other avenues will be considerably more common.

My point is that the use case for libraries is generally weak and the
solution is very expensive -- don't forget the authenticity checks must
also be done on the "good" files. As you start dealing with more and more
data, this system is not sustainable for the simple reason that maintained
disk space costs a fortune and network capacity is a bottleneck. It's no
big deal to do this on a few TB since our repositories don't have to worry
about the integrity of dynamic data, but you eventually get to a point
where it sucks up too many systems resources and consumes too much
expertise.

Authoritative files really should be offline but if online access to
authoritative files is seen as an imperative, it at least makes more sense
to just do something like dump it all in Glacier and slowly refresh
everything you own with authoritative copy. Or better yet, just leave the
stuff there and just make new derivatives when there is any reason to
believe the existing ones are not good.

While I think integrity is an issue, I think other deficiencies in
repositories are  more pressing. Except for the simplest use cases, getting
stuff in or out of them is a hopeless process even with automated
assistance. Metadata and maintenance aren't very good either. That you
still need coding skills to get popular platforms that have been in use for
many years to ingest and serve up things as simple as documents and images
speaks volumes.

kyle


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Cornel Darden Jr.
Hello,

Also, ideally, you would make sure the distribution path of the checksum is 
separate from the data itself. Otherwise, if someone gain unauthorized access 
to the data, they may also have unauthorized access to the checksum file. It's 
also unwise to have the checksum in a file header (unless the file itself is 
encrypted). If this is reverse engineered, the attacker could modify this so 
that the checksum operation passes.

But the checksum operation gives the answer to a binary question: Is the data 
the same or different?


You must have additional layers built on top of it to answer questions such as 
"It was changed with what credentials?" "How did the file change?" "When was it 
changed?" and so on.

Thanks,

Cornel Darden Jr.  
MSLIS
Library Department Chair
South Suburban College
7087052945

"Our Mission is to Serve our Students and the Community through lifelong 
learning."

Sent from my iPhone

> On Oct 3, 2014, at 10:21 AM, Al Matthews  wrote:
> 
> I’m not sure I understand the prior comment about compression.
> 
> I agree that hashing workflows are not simple nor of-themselves secure. I 
> agree with the implication that they can explode in scope.
> 
> From what I can tell, the state of hashing verification tools reflects 
> substantial confusion over their utility and purpose. In some ways it’s a 
> quixotic attempt to re-invent LOCKSS or equivalent. In other ways it’s 
> perfectly sensible.
> 
> I think that the move to evaluate SHA-256 reflects some clear concern over 
> tampering (as does the history of LOCKSS e.g. Itself). This is not to say 
> that MD5 collisions (much less, substitutions) are mathematically trivial, 
> but rather, that they are now commonly contemplated.
> 
> Compare Bruce Schneier’s comments about abandoning SHA-1 entirely, or 
> computation’s reliance on Cyclic Redundancy Checks. In many ways it’s an 
> InfoSec consideration dropped in the middle of archival or library workflow 
> specification.
> 
> --
> Al Matthews
> Software Developer, Digital Services Unit
> Atlanta University Center, Robert W. Woodruff Library
> email: amatth...@auctr.edu; office: 1 404 978 2057
> 
> 
> From: Charles Blair mailto:c...@uchicago.edu>>
> Organization: The University of Chicago Library
> Reply-To: "c...@uchicago.edu<mailto:c...@uchicago.edu>" 
> mailto:c...@uchicago.edu>>
> Date: Friday, October 3, 2014 at 10:26 AM
> To: "CODE4LIB@LISTSERV.ND.EDU<mailto:CODE4LIB@LISTSERV.ND.EDU>" 
> mailto:CODE4LIB@LISTSERV.ND.EDU>>
> Subject: Re: [CODE4LIB] What is the real impact of SHA-256? - Updated
> 
> Look at slide 15 here:
> http://www.slideshare.net/DuraSpace/sds-cwebinar-1
> 
> I think we're worried about the cumulative effect over time of
> undetected errors (at least, I am).
> 
> On Fri, Oct 03, 2014 at 05:37:14AM -0700, Kyle Banerjee wrote:
> On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero 
> mailto:sesunc...@gmail.com>> wrote:
> 
>> Checksums can be kept separate (tripwire style).
>> For JHU archiving, the use of MD5 would give false positives for duplicate
>> detection.
>> 
>> There is no reason to use a bad cryptographic hash. Use a fast hash, or use
>> a safe hash.
> 
> I have always been puzzled why so much energy is expended on bit integrity
> in the library and archival communities. Hashing does not accommodate
> modification of internal metadata or compression which do not compromise
> integrity. And if people who can access the files can also access the
> hashes, there is no contribution to security. Also, wholesale hashing of
> repositories scales poorly,  My guess is that the biggest threats are staff
> error or rogue processes (i.e. bad programming). Any malicious
> destruction/modification is likely to be an inside job.
> 
> In reality, using file size alone is probably sufficient for detecting
> changed files -- if dup detection is desired, then hashing the few that dup
> out can be performed. Though if dups are an actual issue, it reflects
> problems elsewhere. Thrashing disks and cooking the CPU for the purposes
> libraries use hashes for seems way overkill, especially given that basic
> interaction with repositories for depositors, maintainers, and users is
> still in a very primitive state.
> 
> kyle
> 
> 
> --
> Charles Blair, Director, Digital Library Development Center, University of 
> Chicago Library
> 1 773 702 8459 | c...@uchicago.edu<mailto:c...@uchicago.edu> | 
> http://www.lib.uchicago.edu/~chas/
> 
> 
> **
> The contents of this email and any attachments are confidential.
> They are intended for the named recipient(s) only.
> If you have received this email in error please notify the system
> manager or  the 
> sender immediately and do not disclose the contents to anyone or
> make copies.
> 
> ** IronMail scanned this email for viruses, vandals and malicious
> content. **
> **


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Nathan Tallman
Bit integrity is crucial for libraries and archives, especially government
archives. Authenticity is a key concept for born-digital archives. We need
to be able to definitively say that this file has not changed since it was
received from the donor or organizational unit, for accountability and
transparency issues. The authenticity trail is needed for evidence in
courts and in some cases mandated by the government. And of course fixity
checking also helps detect bit corruption, another important part of
digital preservation.

Regarding, if someone has access to the file, they have access to the
checksum, it's not always the whole picture. Best practices for digital
preservation recommend having copies in multiple places, like a dark
archive, and systematically running checksums on all the copies and
comparing them. Someone might be able to gain access to one system, but
much more unlikely that they'll get access to all systems. So, if there's a
fixity change in one place and not others, it is flagged for investigation
and comparison.

Nathan

On Fri, Oct 3, 2014 at 8:37 AM, Kyle Banerjee 
wrote:

> On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero  wrote:
>
> > Checksums can be kept separate (tripwire style).
> > For JHU archiving, the use of MD5 would give false positives for
> duplicate
> > detection.
> >
> > There is no reason to use a bad cryptographic hash. Use a fast hash, or
> use
> > a safe hash.
> >
>
> I have always been puzzled why so much energy is expended on bit integrity
> in the library and archival communities. Hashing does not accommodate
> modification of internal metadata or compression which do not compromise
> integrity. And if people who can access the files can also access the
> hashes, there is no contribution to security. Also, wholesale hashing of
> repositories scales poorly,  My guess is that the biggest threats are staff
> error or rogue processes (i.e. bad programming). Any malicious
> destruction/modification is likely to be an inside job.
>
> In reality, using file size alone is probably sufficient for detecting
> changed files -- if dup detection is desired, then hashing the few that dup
> out can be performed. Though if dups are an actual issue, it reflects
> problems elsewhere. Thrashing disks and cooking the CPU for the purposes
> libraries use hashes for seems way overkill, especially given that basic
> interaction with repositories for depositors, maintainers, and users is
> still in a very primitive state.
>
> kyle
>


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Al Matthews
I’m not sure I understand the prior comment about compression.

I agree that hashing workflows are not simple nor of-themselves secure. I agree 
with the implication that they can explode in scope.

From what I can tell, the state of hashing verification tools reflects 
substantial confusion over their utility and purpose. In some ways it’s a 
quixotic attempt to re-invent LOCKSS or equivalent. In other ways it’s 
perfectly sensible.

I think that the move to evaluate SHA-256 reflects some clear concern over 
tampering (as does the history of LOCKSS e.g. Itself). This is not to say that 
MD5 collisions (much less, substitutions) are mathematically trivial, but 
rather, that they are now commonly contemplated.

Compare Bruce Schneier’s comments about abandoning SHA-1 entirely, or 
computation’s reliance on Cyclic Redundancy Checks. In many ways it’s an 
InfoSec consideration dropped in the middle of archival or library workflow 
specification.

--
Al Matthews
Software Developer, Digital Services Unit
Atlanta University Center, Robert W. Woodruff Library
email: amatth...@auctr.edu; office: 1 404 978 2057


From: Charles Blair mailto:c...@uchicago.edu>>
Organization: The University of Chicago Library
Reply-To: "c...@uchicago.edu<mailto:c...@uchicago.edu>" 
mailto:c...@uchicago.edu>>
Date: Friday, October 3, 2014 at 10:26 AM
To: "CODE4LIB@LISTSERV.ND.EDU<mailto:CODE4LIB@LISTSERV.ND.EDU>" 
mailto:CODE4LIB@LISTSERV.ND.EDU>>
Subject: Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

Look at slide 15 here:
http://www.slideshare.net/DuraSpace/sds-cwebinar-1

I think we're worried about the cumulative effect over time of
undetected errors (at least, I am).

On Fri, Oct 03, 2014 at 05:37:14AM -0700, Kyle Banerjee wrote:
On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero 
mailto:sesunc...@gmail.com>> wrote:

> Checksums can be kept separate (tripwire style).
> For JHU archiving, the use of MD5 would give false positives for duplicate
> detection.
>
> There is no reason to use a bad cryptographic hash. Use a fast hash, or use
> a safe hash.
>

I have always been puzzled why so much energy is expended on bit integrity
in the library and archival communities. Hashing does not accommodate
modification of internal metadata or compression which do not compromise
integrity. And if people who can access the files can also access the
hashes, there is no contribution to security. Also, wholesale hashing of
repositories scales poorly,  My guess is that the biggest threats are staff
error or rogue processes (i.e. bad programming). Any malicious
destruction/modification is likely to be an inside job.

In reality, using file size alone is probably sufficient for detecting
changed files -- if dup detection is desired, then hashing the few that dup
out can be performed. Though if dups are an actual issue, it reflects
problems elsewhere. Thrashing disks and cooking the CPU for the purposes
libraries use hashes for seems way overkill, especially given that basic
interaction with repositories for depositors, maintainers, and users is
still in a very primitive state.

kyle


--
Charles Blair, Director, Digital Library Development Center, University of 
Chicago Library
1 773 702 8459 | c...@uchicago.edu<mailto:c...@uchicago.edu> | 
http://www.lib.uchicago.edu/~chas/


**
The contents of this email and any attachments are confidential.
They are intended for the named recipient(s) only.
If you have received this email in error please notify the system
manager or  the 
sender immediately and do not disclose the contents to anyone or
make copies.

** IronMail scanned this email for viruses, vandals and malicious
content. **
**

Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Cornel Darden Jr.
Hello,

A checksum function can verify only data integrity--that is, only whether the 
data matches the expected values (and even this is not perfect). The change 
could come in the form of malicious attack or a simple write or transmission 
error. It cannot determine whether the change is malicious.

Thanks,

Cornel Darden Jr.  
MSLIS
Library Department Chair
South Suburban College
7087052945

"Our Mission is to Serve our Students and the Community through lifelong 
learning."

Sent from my iPhone

> On Oct 2, 2014, at 5:34 PM, Jonathan Rochkind  wrote:
> 
> For checksums for ensuring archival integrity, are cryptographic flaws 
> relavent? I'm not sure, is part of the point of a checksum to ensure against 
> _malicious_ changes to files?  I honestly don't know. (But in most systems, 
> I'd guess anyone who had access to maliciously change the file would also 
> have access to maliciously change the checksum!)
> 
> Rot13 is not suitable as a checksum for ensuring archival integrity however, 
> because it's output is no smaller than it's input, which is kind of what 
> you're looking for. 
> 
> 
> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Cary Gordon 
> [listu...@chillco.com]
> Sent: Thursday, October 02, 2014 5:51 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] What is the real impact of SHA-256? - Updated
> 
> +1
> 
> MD5 is little better than ROT13. At least with ROT13, you have no illusions.
> 
> We use SHA 512 for most work. We don't do finance or national security, so it 
> is a good fit for us.
> 
> Cary
> 
>> On Oct 2, 2014, at 12:30 PM, Simon Spero  wrote:
>> 
>> Intel skylake processors have dedicated sha instructions.
>> See: https://software.intel.com/en-us/articles/intel-sha-extensions
>> 
>> Using a tree hash approach (which is inherently embarrassingly parallel)
>> will leave io time dominant. This approach is used by Amazon glacier - see
>> http://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html
>> 
>> MD5 is broken, and cannot be used for any security purposes. It cannot be
>> used for deduplication if any of the files are in the directories of
>> security researchers!
>> 
>> If security is not a concern then there are many faster hashing algorithms
>> that avoid the costs imposed by the need to defend against adversaries.
>> See siphash, murmur, cityhash, etc.
>> 
>> Simon
>>> On Oct 2, 2014 11:18 AM, "Alex Duryee"  wrote:
>>> 
>>> Despite some of its relative flaws, MD5 is frequently selected over SHA-256
>>> in archives as the checksum algorithm of choice. One of the primary factors
>>> here is the longer processing time required for SHA-256, though there have
>>> been no empirical studies calculating that time difference and its overall
>>> impact on checksum generation and verification in a preservation
>>> environment.
>>> 
>>> AVPreserve Consultant Alex Duryee recently ran a series of tests comparing
>>> the real time and cpu time used by each algorithm. His newly updated white
>>> paper "What Is the Real Impact of SHA-256?" presents the results and comes
>>> to some interesting conclusions regarding the actual time difference
>>> between the two and what other factors may have a greater impact on your
>>> selection decision and file monitoring workflow. The paper can be
>>> downloaded for free at
>>> 
>>> http://www.avpreserve.com/papers-and-presentations/whats-the-real-impact-of-sha-256/
>>> .
>>> __
>>> 
>>> Alex Duryee
>>> *AVPreserve*
>>> 350 7th Ave., Suite 1605
>>> New York, NY 10001
>>> 
>>> office: 917-475-9630
>>> 
>>> http://www.avpreserve.com
>>> Facebook.com/AVPreserve <http://facebook.com/AVPreserve>
>>> twitter.com/AVPreserve
>>> 


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Charles Blair
Look at slide 15 here:
http://www.slideshare.net/DuraSpace/sds-cwebinar-1

I think we're worried about the cumulative effect over time of
undetected errors (at least, I am).

On Fri, Oct 03, 2014 at 05:37:14AM -0700, Kyle Banerjee wrote:
> On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero  wrote:
> 
> > Checksums can be kept separate (tripwire style).
> > For JHU archiving, the use of MD5 would give false positives for duplicate
> > detection.
> >
> > There is no reason to use a bad cryptographic hash. Use a fast hash, or use
> > a safe hash.
> >
> 
> I have always been puzzled why so much energy is expended on bit integrity
> in the library and archival communities. Hashing does not accommodate
> modification of internal metadata or compression which do not compromise
> integrity. And if people who can access the files can also access the
> hashes, there is no contribution to security. Also, wholesale hashing of
> repositories scales poorly,  My guess is that the biggest threats are staff
> error or rogue processes (i.e. bad programming). Any malicious
> destruction/modification is likely to be an inside job.
> 
> In reality, using file size alone is probably sufficient for detecting
> changed files -- if dup detection is desired, then hashing the few that dup
> out can be performed. Though if dups are an actual issue, it reflects
> problems elsewhere. Thrashing disks and cooking the CPU for the purposes
> libraries use hashes for seems way overkill, especially given that basic
> interaction with repositories for depositors, maintainers, and users is
> still in a very primitive state.
> 
> kyle
> 

-- 
Charles Blair, Director, Digital Library Development Center, University of 
Chicago Library
1 773 702 8459 | c...@uchicago.edu | http://www.lib.uchicago.edu/~chas/


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-03 Thread Kyle Banerjee
On Thu, Oct 2, 2014 at 3:47 PM, Simon Spero  wrote:

> Checksums can be kept separate (tripwire style).
> For JHU archiving, the use of MD5 would give false positives for duplicate
> detection.
>
> There is no reason to use a bad cryptographic hash. Use a fast hash, or use
> a safe hash.
>

I have always been puzzled why so much energy is expended on bit integrity
in the library and archival communities. Hashing does not accommodate
modification of internal metadata or compression which do not compromise
integrity. And if people who can access the files can also access the
hashes, there is no contribution to security. Also, wholesale hashing of
repositories scales poorly,  My guess is that the biggest threats are staff
error or rogue processes (i.e. bad programming). Any malicious
destruction/modification is likely to be an inside job.

In reality, using file size alone is probably sufficient for detecting
changed files -- if dup detection is desired, then hashing the few that dup
out can be performed. Though if dups are an actual issue, it reflects
problems elsewhere. Thrashing disks and cooking the CPU for the purposes
libraries use hashes for seems way overkill, especially given that basic
interaction with repositories for depositors, maintainers, and users is
still in a very primitive state.

kyle


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-02 Thread Simon Spero
Checksums can be kept separate (tripwire style).
For JHU archiving, the use of MD5 would give false positives for duplicate
detection.

There is no reason to use a bad cryptographic hash. Use a fast hash, or use
a safe hash.

Simon
On Oct 2, 2014 6:34 PM, "Jonathan Rochkind"  wrote:

> For checksums for ensuring archival integrity, are cryptographic flaws
> relavent? I'm not sure, is part of the point of a checksum to ensure
> against _malicious_ changes to files?  I honestly don't know. (But in most
> systems, I'd guess anyone who had access to maliciously change the file
> would also have access to maliciously change the checksum!)
>
> Rot13 is not suitable as a checksum for ensuring archival integrity
> however, because it's output is no smaller than it's input, which is kind
> of what you're looking for.
>
> 
> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Cary
> Gordon [listu...@chillco.com]
> Sent: Thursday, October 02, 2014 5:51 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] What is the real impact of SHA-256? - Updated
>
> +1
>
> MD5 is little better than ROT13. At least with ROT13, you have no
> illusions.
>
> We use SHA 512 for most work. We don't do finance or national security, so
> it is a good fit for us.
>
> Cary
>
> On Oct 2, 2014, at 12:30 PM, Simon Spero  wrote:
>
> > Intel skylake processors have dedicated sha instructions.
> > See: https://software.intel.com/en-us/articles/intel-sha-extensions
> >
> > Using a tree hash approach (which is inherently embarrassingly parallel)
> > will leave io time dominant. This approach is used by Amazon glacier -
> see
> >
> http://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html
> >
> > MD5 is broken, and cannot be used for any security purposes. It cannot be
> > used for deduplication if any of the files are in the directories of
> > security researchers!
> >
> > If security is not a concern then there are many faster hashing
> algorithms
> > that avoid the costs imposed by the need to defend against adversaries.
> > See siphash, murmur, cityhash, etc.
> >
> > Simon
> > On Oct 2, 2014 11:18 AM, "Alex Duryee"  wrote:
> >
> >> Despite some of its relative flaws, MD5 is frequently selected over
> SHA-256
> >> in archives as the checksum algorithm of choice. One of the primary
> factors
> >> here is the longer processing time required for SHA-256, though there
> have
> >> been no empirical studies calculating that time difference and its
> overall
> >> impact on checksum generation and verification in a preservation
> >> environment.
> >>
> >> AVPreserve Consultant Alex Duryee recently ran a series of tests
> comparing
> >> the real time and cpu time used by each algorithm. His newly updated
> white
> >> paper "What Is the Real Impact of SHA-256?" presents the results and
> comes
> >> to some interesting conclusions regarding the actual time difference
> >> between the two and what other factors may have a greater impact on your
> >> selection decision and file monitoring workflow. The paper can be
> >> downloaded for free at
> >>
> >>
> http://www.avpreserve.com/papers-and-presentations/whats-the-real-impact-of-sha-256/
> >> .
> >> __
> >>
> >> Alex Duryee
> >> *AVPreserve*
> >> 350 7th Ave., Suite 1605
> >> New York, NY 10001
> >>
> >> office: 917-475-9630
> >>
> >> http://www.avpreserve.com
> >> Facebook.com/AVPreserve <http://facebook.com/AVPreserve>
> >> twitter.com/AVPreserve
> >>
>


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-02 Thread Simon Spero
SHA-256 is authorized up to SECRET.
SHA-384+ is required for TOP SECRET.

Algorithms approved for more stringent requirements such as FOUO-SCI (SES
Covering-up Incompetence) have not been revealed, though CARPA has funded
research into plausible repudiability.


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-02 Thread Jonathan Rochkind
For checksums for ensuring archival integrity, are cryptographic flaws 
relavent? I'm not sure, is part of the point of a checksum to ensure against 
_malicious_ changes to files?  I honestly don't know. (But in most systems, I'd 
guess anyone who had access to maliciously change the file would also have 
access to maliciously change the checksum!)

Rot13 is not suitable as a checksum for ensuring archival integrity however, 
because it's output is no smaller than it's input, which is kind of what you're 
looking for. 


From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] on behalf of Cary Gordon 
[listu...@chillco.com]
Sent: Thursday, October 02, 2014 5:51 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

+1

MD5 is little better than ROT13. At least with ROT13, you have no illusions.

We use SHA 512 for most work. We don't do finance or national security, so it 
is a good fit for us.

Cary

On Oct 2, 2014, at 12:30 PM, Simon Spero  wrote:

> Intel skylake processors have dedicated sha instructions.
> See: https://software.intel.com/en-us/articles/intel-sha-extensions
>
> Using a tree hash approach (which is inherently embarrassingly parallel)
> will leave io time dominant. This approach is used by Amazon glacier - see
> http://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html
>
> MD5 is broken, and cannot be used for any security purposes. It cannot be
> used for deduplication if any of the files are in the directories of
> security researchers!
>
> If security is not a concern then there are many faster hashing algorithms
> that avoid the costs imposed by the need to defend against adversaries.
> See siphash, murmur, cityhash, etc.
>
> Simon
> On Oct 2, 2014 11:18 AM, "Alex Duryee"  wrote:
>
>> Despite some of its relative flaws, MD5 is frequently selected over SHA-256
>> in archives as the checksum algorithm of choice. One of the primary factors
>> here is the longer processing time required for SHA-256, though there have
>> been no empirical studies calculating that time difference and its overall
>> impact on checksum generation and verification in a preservation
>> environment.
>>
>> AVPreserve Consultant Alex Duryee recently ran a series of tests comparing
>> the real time and cpu time used by each algorithm. His newly updated white
>> paper "What Is the Real Impact of SHA-256?" presents the results and comes
>> to some interesting conclusions regarding the actual time difference
>> between the two and what other factors may have a greater impact on your
>> selection decision and file monitoring workflow. The paper can be
>> downloaded for free at
>>
>> http://www.avpreserve.com/papers-and-presentations/whats-the-real-impact-of-sha-256/
>> .
>> __
>>
>> Alex Duryee
>> *AVPreserve*
>> 350 7th Ave., Suite 1605
>> New York, NY 10001
>>
>> office: 917-475-9630
>>
>> http://www.avpreserve.com
>> Facebook.com/AVPreserve <http://facebook.com/AVPreserve>
>> twitter.com/AVPreserve
>>


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-02 Thread Cary Gordon
+1

MD5 is little better than ROT13. At least with ROT13, you have no illusions.

We use SHA 512 for most work. We don't do finance or national security, so it 
is a good fit for us.

Cary

On Oct 2, 2014, at 12:30 PM, Simon Spero  wrote:

> Intel skylake processors have dedicated sha instructions.
> See: https://software.intel.com/en-us/articles/intel-sha-extensions
> 
> Using a tree hash approach (which is inherently embarrassingly parallel)
> will leave io time dominant. This approach is used by Amazon glacier - see
> http://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html
> 
> MD5 is broken, and cannot be used for any security purposes. It cannot be
> used for deduplication if any of the files are in the directories of
> security researchers!
> 
> If security is not a concern then there are many faster hashing algorithms
> that avoid the costs imposed by the need to defend against adversaries.
> See siphash, murmur, cityhash, etc.
> 
> Simon
> On Oct 2, 2014 11:18 AM, "Alex Duryee"  wrote:
> 
>> Despite some of its relative flaws, MD5 is frequently selected over SHA-256
>> in archives as the checksum algorithm of choice. One of the primary factors
>> here is the longer processing time required for SHA-256, though there have
>> been no empirical studies calculating that time difference and its overall
>> impact on checksum generation and verification in a preservation
>> environment.
>> 
>> AVPreserve Consultant Alex Duryee recently ran a series of tests comparing
>> the real time and cpu time used by each algorithm. His newly updated white
>> paper "What Is the Real Impact of SHA-256?" presents the results and comes
>> to some interesting conclusions regarding the actual time difference
>> between the two and what other factors may have a greater impact on your
>> selection decision and file monitoring workflow. The paper can be
>> downloaded for free at
>> 
>> http://www.avpreserve.com/papers-and-presentations/whats-the-real-impact-of-sha-256/
>> .
>> __
>> 
>> Alex Duryee
>> *AVPreserve*
>> 350 7th Ave., Suite 1605
>> New York, NY 10001
>> 
>> office: 917-475-9630
>> 
>> http://www.avpreserve.com
>> Facebook.com/AVPreserve 
>> twitter.com/AVPreserve
>> 


Re: [CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-02 Thread Simon Spero
Intel skylake processors have dedicated sha instructions.
See: https://software.intel.com/en-us/articles/intel-sha-extensions

Using a tree hash approach (which is inherently embarrassingly parallel)
will leave io time dominant. This approach is used by Amazon glacier - see
http://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html

MD5 is broken, and cannot be used for any security purposes. It cannot be
used for deduplication if any of the files are in the directories of
security researchers!

If security is not a concern then there are many faster hashing algorithms
that avoid the costs imposed by the need to defend against adversaries.
See siphash, murmur, cityhash, etc.

Simon
On Oct 2, 2014 11:18 AM, "Alex Duryee"  wrote:

> Despite some of its relative flaws, MD5 is frequently selected over SHA-256
> in archives as the checksum algorithm of choice. One of the primary factors
> here is the longer processing time required for SHA-256, though there have
> been no empirical studies calculating that time difference and its overall
> impact on checksum generation and verification in a preservation
> environment.
>
> AVPreserve Consultant Alex Duryee recently ran a series of tests comparing
> the real time and cpu time used by each algorithm. His newly updated white
> paper "What Is the Real Impact of SHA-256?" presents the results and comes
> to some interesting conclusions regarding the actual time difference
> between the two and what other factors may have a greater impact on your
> selection decision and file monitoring workflow. The paper can be
> downloaded for free at
>
> http://www.avpreserve.com/papers-and-presentations/whats-the-real-impact-of-sha-256/
> .
> __
>
> Alex Duryee
> *AVPreserve*
> 350 7th Ave., Suite 1605
> New York, NY 10001
>
> office: 917-475-9630
>
> http://www.avpreserve.com
> Facebook.com/AVPreserve 
> twitter.com/AVPreserve
>


[CODE4LIB] What is the real impact of SHA-256? - Updated

2014-10-02 Thread Alex Duryee
Despite some of its relative flaws, MD5 is frequently selected over SHA-256
in archives as the checksum algorithm of choice. One of the primary factors
here is the longer processing time required for SHA-256, though there have
been no empirical studies calculating that time difference and its overall
impact on checksum generation and verification in a preservation
environment.

AVPreserve Consultant Alex Duryee recently ran a series of tests comparing
the real time and cpu time used by each algorithm. His newly updated white
paper "What Is the Real Impact of SHA-256?" presents the results and comes
to some interesting conclusions regarding the actual time difference
between the two and what other factors may have a greater impact on your
selection decision and file monitoring workflow. The paper can be
downloaded for free at
http://www.avpreserve.com/papers-and-presentations/whats-the-real-impact-of-sha-256/
.
__

Alex Duryee
*AVPreserve*
350 7th Ave., Suite 1605
New York, NY 10001

office: 917-475-9630

http://www.avpreserve.com
Facebook.com/AVPreserve 
twitter.com/AVPreserve