it spark 1.5.1
the dataframe has simply 2 columns, both string
a sql query would be more efficient probably, but doesnt fit out purpose
(we are doing a lot more stuff where we need rdds).
also i am just trying to understand in general what in that rdd coming from
a dataframe could slow things dow
Can you please provide the high-level schema of the entities that you are
attempting to join? I think that you may be able to use a more efficient
technique to join these together; perhaps by registering the Dataframes as
temp tables and constructing a Spark SQL query.
Also, which version of Spark
we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
wouldn't even finish overnight anymore. the change was that the rdd was now
derived from a dataframe.
so the new code that runs forever is something like this:
dataframe.rdd.map(row => (Row(row(0)), row)).join(...)
any idea why?