jteagles commented on pull request #125: URL: https://github.com/apache/tez/pull/125#issuecomment-896294319
@belugabehr, In general, it could be a good change. Couple of thing on my mind about this. I have been thinking about replacing apache base64 due to performance after reading http://java-performance.info/base64-encoding-and-decoding-performance/ and seeing native java8 performance. However, as Hadoop depends directly on commons-codec, thing won't truly rid tez of the transitive dependency, only the direct dependency. It's really important to switch from apache to java.util.Base64 in the backwards compatible way The org.apache.commons.codec.binary.Base64 encoder says it returns an RFC 2045 compliant encoder. https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html java.util.Base64.getEncoder() returns an RFC 4648 compliant encoder, which uses the same alphabet an RFC 2045, but without line lengths and line endings. https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html I haven't checked this with code, but they will need to be exactly the same as, the shuffle handler could be running a different version of tez than the job. So they have to be backwards compatible. I thought Base64.getMimeEncoder() will give a RFC 2045 compliant encoder in java.util.Base64 (76 line lengths with CRLF endings), but they do not match ``` java @Test public void testCodecEquivalence() throws Exception { byte[] bytes = new byte[100]; String enc1 = Base64.getEncoder().encodeToString(bytes); String enc2 = new String(org.apache.commons.codec.binary.Base64.encodeBase64(bytes)); String enc3 = Base64.getMimeEncoder().encodeToString(bytes); System.out.println("enc1 = <" + enc1 + ">"); System.out.println("enc2 = <" + enc2 + ">"); System.out.println("enc3 = <" + enc3 + ">"); System.out.println(enc1.equals(enc2)); System.out.println(enc1.equals(enc3)); } ``` ``` true false ``` So perhaps line lengths are skipped with apache and makes it RFC 4648 compatible. As for the sha384 bit, that's more complicated. I would rather not add more dependency on guava until we are using a shaded version as that library has not shown very good API stability and causes uses issues. That being said, if you look at how the sha384 is being used, it is very much overkill. In DAPAppMaster.isSameFile, two files are being read fully to compute sha384 and then SHAs are compared to see if they are equal. Same for DAG.verifyLocalResources. Perhaps better would be just to walk both file streams and stop when they differ. It will give an earlier exit for failed case. Sha384 seems just an added complication. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
