Agreed with the statement in quotes below whether one wants to do unit
tests or not It is a good practice to write code that way. But I think the
more painful and tedious task is to mock/emulate all the nodes such as
spark workers/master/hdfs/input source stream and all that. I wish there is
someth
>
> Basically you abstract your transformations to take in a dataframe and
> return one, then you assert on the returned df
>
+1 to this suggestion. This is why we wanted streaming and batch
dataframes to share the same API.
This depends on your target setup! I run for example for my open source
libraries for spark integration tests (a dedicated folder a side the unit
tests) a local spark master, but also use a minidfs cluster (to simulate HDFS
on a node) and sometimes also a miniyarn cluster (see
https://wiki.apac
Hey kant
You can use holdens spark test base
Have a look at some of the specs I wrote here to give you an idea
https://github.com/samelamin/spark-bigquery/blob/master/src/test/scala/com/samelamin/spark/bigquery/BigQuerySchemaSpecs.scala
Basically you abstract your transformations to take in a d