[GitHub] [tez] jteagles commented on pull request #125: TEZ-4285: Drop Commons Codec Dependency

GitBox Tue, 10 Aug 2021 13:25:27 -0700


jteagles commented on pull request #125:
URL: https://github.com/apache/tez/pull/125#issuecomment-896294319



   @belugabehr, In general, it could be a good change. Couple of thing on my 
mind about this.
   
   I have been thinking about replacing apache base64 due to performance after 
reading
   http://java-performance.info/base64-encoding-and-decoding-performance/
   
   and seeing native java8 performance.
   
   However, as Hadoop depends directly on commons-codec, thing won't truly rid 
tez of the transitive dependency, only the direct dependency.
   
   It's really important to switch from apache to java.util.Base64 in the 
backwards compatible way
   The org.apache.commons.codec.binary.Base64 encoder says it returns an RFC 
2045 compliant encoder.
   
https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64.html
   
   java.util.Base64.getEncoder() returns an RFC 4648 compliant encoder, which 
uses the same alphabet an RFC 2045, but without line lengths and line endings.
   https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html
   
   I haven't checked this with code, but they will need to be exactly the same 
as, the shuffle handler could be running a different version of tez than the 
job. So they have to be backwards compatible. 
   
   I thought Base64.getMimeEncoder() will give a RFC 2045 compliant encoder in 
java.util.Base64 (76 line lengths with CRLF endings), but they do not match
   
   ``` java
     @Test
     public void testCodecEquivalence() throws Exception {
       byte[] bytes = new byte[100];
       String enc1 = Base64.getEncoder().encodeToString(bytes);
       String enc2 = new 
String(org.apache.commons.codec.binary.Base64.encodeBase64(bytes));
       String enc3 = Base64.getMimeEncoder().encodeToString(bytes);
       System.out.println("enc1 = <" + enc1 + ">");
       System.out.println("enc2 = <" + enc2 + ">");
       System.out.println("enc3 = <" + enc3 + ">");
       System.out.println(enc1.equals(enc2));
       System.out.println(enc1.equals(enc3));
     }
   ```
   ```
   true
   false
   ```
   So perhaps line lengths are skipped with apache and makes it RFC 4648 
compatible.
   
   
   As for the sha384 bit, that's more complicated. I would rather not add more 
dependency on guava until we are using a shaded version as that library has not 
shown very good API stability and causes uses issues. That being said, if you 
look at how the sha384 is being used, it is very much overkill. In 
DAPAppMaster.isSameFile, two files are being read fully to compute sha384 and 
then SHAs are compared to see if they are equal. Same for 
DAG.verifyLocalResources. Perhaps better would be just to walk both file 
streams and stop when they differ. It will give an earlier exit for failed 
case. Sha384 seems just an added complication. What do you think?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [tez] jteagles commented on pull request #125: TEZ-4285: Drop Commons Codec Dependency

Reply via email to