You can use analytical functions in spark sql.
Something like select * from (select id, row_number() over (partition by id
order by timestamp ) as rn from root) where rn=1
On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal wrote:
> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp:
Untested, but something like the below should work:
from pyspark.sql import functions as F
from pyspark.sql import window as W
(record
.withColumn('ts_rank',
F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id"))
.filter(F.col('ts_rank')==1)
.drop('ts_rank')
)
On Mon, Dec 17, 2018
Hi guys,
I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
Boolean, other metrics)
Schema looks like this:
root
|-- id: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- isValid: boolean (nullable = true)
.
I need to find the earliest valid record