steveloughran commented on pull request #2971: URL: https://github.com/apache/hadoop/pull/2971#issuecomment-1066904248
OK. I'm going to say "sorry, no" to the idea of using diff to validate JSON files; think a bit about dest file validation. JSON is there to be parsed, the bundled diagnostics and iostats change, and the file paths will between local, abfs and gcs. the way to validate it is to read it in and make assertions on it. Alongside this PR, i have a private fork of google gcs which subclasses all the tests and runs them against google cloud and end to end test through spark standalone https://github.com/hortonworks-spark/cloud-integration these tests verify the committer works for dataframe, and spark sql for orc/parquet and csv https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/abfs/commit/AbfsCommitDataframeSuite.scala#L83 https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples/src/test/scala/org/apache/spark/sql/hive/orc/abfs https://github.com/hortonworks-spark/cloud-integration/tree/master/cloud-examples/src/test/scala/org/apache/spark/sql/hive/orc/gs these tests are loading and validating the success file (and its truncated list of generated files) with the filesystem https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/main/scala/com/cloudera/spark/cloud/s3/S3AOperations.scala#L54 this is all an evolution of the existing suites for the s3a committers -which is where the success file came from. I would rather do the detailed test here as they are full integration tests. It is fairly tricky to get them building however; takes an hour+ for a full compile, which needs to be repeated every morning (-SNAPSHOT artifacts, see). what i can do in the hadoop tests is add a test to load a success file and validate it against the output, and that there are no unknown files there. i'd love some suggestions as improvements to the spark ones too. it's a mix of my own and some I moved from the apache spark sql suites and reworked to be targetable at different filesystems. one thing i don't test there is writing data over existing files in a complex partition tree...i should do that, which i can do after this patch is in... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
