Hi All,

Just wanted to check back regarding best way to perform deduplication. Is
using drop duplicates the optimal way to get rid of duplicates? Would it be
better if we run operations on red directly?

Also what about if we want to keep the last value of the group while
performing deduplication (based on some sorting criteria)?

Thanks,
Rishi

On Mon, May 20, 2019 at 3:33 PM Nicholas Hakobian <
nicholas.hakob...@rallyhealth.com> wrote:

> From doing some searching around in the spark codebase, I found the
> following:
>
>
> https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474
>
> So it appears there is no direct operation called dropDuplicates or
> Deduplicate, but there is an optimizer rule that converts this logical
> operation to a physical operation that is equivalent to grouping by all the
> columns you want to deduplicate across (or all columns if you are doing
> something like distinct), and taking the First() value. So (using a pySpark
> code example):
>
> df = input_df.dropDuplicates(['col1', 'col2'])
>
> Is effectively shorthand for saying something like:
>
> df = input_df.groupBy('col1',
> 'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*')
>
> Except I assume that it has some internal optimization so it doesn't need
> to pack/unpack the column data, and just returns the whole Row.
>
> Nicholas Szandor Hakobian, Ph.D.
> Principal Data Scientist
> Rally Health
> nicholas.hakob...@rallyhealth.com
>
>
>
> On Mon, May 20, 2019 at 11:38 AM Yeikel <em...@yeikel.com> wrote:
>
>> Hi ,
>>
>> I am looking for a high level explanation(overview) on how
>> dropDuplicates[1]
>> works.
>>
>> [1]
>>
>> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326
>>
>> Could someone please explain?
>>
>> Thank you
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

-- 
Regards,

Rishi Shah

Reply via email to