Dataframe vs Dataset dilemma: either Row parsing or no filter push-down

2018-06-18 Thread Valery Khamenya
Hi Spark gurus, I was surprised to read here: https://stackoverflow.com/questions/50129411/why-is-predicate-pushdown-not-used-in-typed-dataset-api-vs-untyped-dataframe-ap that filters are not pushed down in typed Datasets and one should rather stick to Dataframes. But writing code for

best practices to implement library of custom transformations of Dataframe/Dataset

2018-06-18 Thread Valery Khamenya
Dear Spark gurus, *Question*: what way would you recommend to shape a library of custom transformations for Dataframes/Datasets? *Details*: e.g., consider we need several custom transformations over the Dataset/Dataframe instances. For example injecting columns, apply specially tuned row

smarter way to "forget" DataFrame definition and stick to its values

2018-05-01 Thread Valery Khamenya
hi all a short example before the long story: var accumulatedDataFrame = ... // initialize for (i <- 1 to 100) { val myTinyNewData = ... // my slowly calculated new data portion in tiny amounts accumulatedDataFrame = accumulatedDataFrame.union(myTinyNewData) // how to stick

all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Valery Khamenya
Hi all I experience a strange thing: when Spark 2.3.0 calculations started from Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager stays at its maximum, albeit nothing is assumed to be calculated anymore. How come? if relevant, I experience this issue since AWS EMR 5.13.0