steveloughran commented on pull request #2971:
URL: https://github.com/apache/hadoop/pull/2971#issuecomment-1066904248


   OK.
   
   I'm going to say "sorry, no" to the idea of using diff to validate JSON 
files; think a bit about dest file validation.
   
   JSON is there to be parsed, the bundled diagnostics and iostats change, and 
the file paths will between local, abfs and gcs.
   
   the way to validate it is to read it in and make assertions on it.
   
   Alongside this PR, i have a private fork of google gcs which subclasses all 
the tests and runs them against google cloud
   
   and end to end test through spark standalone
   https://github.com/hortonworks-spark/cloud-integration
   
   these tests verify the committer works for dataframe, and spark sql for 
orc/parquet and csv
   
https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/abfs/commit/AbfsCommitDataframeSuite.scala#L83
   
https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples/src/test/scala/org/apache/spark/sql/hive/orc/abfs
   
https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples/src/test/scala/org/apache/spark/sql/hive/orc/gs
   
   these tests are loading and validating the success file (and its truncated 
list of generated files) with the filesystem
   
https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/main/scala/com/cloudera/spark/cloud/s3/S3AOperations.scala#L54
   
   this is all an evolution of the existing suites for the s3a committers 
-which is where the success file came from.
   
   I would rather do the detailed test here as they are full integration tests. 
It is fairly tricky to get them building however; takes an hour+ for a full 
compile, which needs to be repeated every morning (-SNAPSHOT artifacts, see).
   
   what i can do in the hadoop tests is add a test to load a success file and 
validate it against the output, and that there are no unknown files there.
   
   i'd love some suggestions as improvements to the spark ones too. it's a mix 
of my own and some I moved from the apache spark sql suites and reworked to be 
targetable at different filesystems. one thing i don't test there is writing 
data over existing files in a complex partition tree...i should do that, which 
i can do after this patch is in...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to