How do you load the data? How do you write it? I fear without a full source code it will be difficult to troubleshoot the issue.
Which Spark version? Use case is not yet 100% clear to me. You want to set the row with the oldest/newest date to true? I would just use top or something similar when processing the data. > On 4. Jun 2018, at 17:33, Jain, Neha T. <neha.t.j...@accenture.com> wrote: > > Hi Jorn, > > I tried removing userid from my sort clause but still the same issue- data > not sorted. > > var newDf = data.repartition(col(userid)).sortWithinPartitions(sid,time) > > I am checking the sorting results by temporary writing this file to Hive as > well as HDFS. Now, when I see the user wise data it is not sorted. > Attaching the output file for your reference. > > On the basis of sorting within userid partitions, I want to add a flag which > marks first item in the partition as true other items in that partition as > false. > If my sorting order is disturbed, the flag is wrongly set. > > Please suggest what else could be done to fix this very basic scenario of > sorting in Spark across multiple partitions across multiple nodes. > > Thanks & Regards, > Neha Jain > > From: Jörn Franke [mailto:jornfra...@gmail.com] > Sent: Monday, June 4, 2018 10:48 AM > To: Sing, Jasbir <jasbir.s...@accenture.com> > Cc: user@spark.apache.org; Patel, Payal <payal.pa...@accenture.com>; Jain, > Neha T. <neha.t.j...@accenture.com> > Subject: [External] Re: Sorting in Spark on multiple partitions > > You partition by userid, why do you then sort again by userid in the > partition? Can you try to remove userid from the sort? > > How do you check if the sort is correct or not? > > What is the underlying objective of the sort? Do you have more information on > schema and data? > > On 4. Jun 2018, at 05:47, Sing, Jasbir <jasbir.s...@accenture.com> wrote: > > Hi Team, > > We are currently using Spark 2.2.0 and facing some challenges in sorting of > data on multiple partitions. > We have tried below approaches: > > Spark SQL approach: > a. var query = "select * from data distribute by " + userid + " sort by > " + userid + ", " + time “ > > This query returns correct results in Hive but not in Spark SQL. > var newDf = data.repartition(col(userud)).orderBy(userid, time) > var newDf = data.repartition(col(userid)).sortWithinPartitions(userid,time) > > > But none of the above approach is giving correct results for sorting of data. > Please suggest what could be done for the same. > > Thanks & Regards, > Neha Jain > > > This message is for the designated recipient only and may contain privileged, > proprietary, or otherwise confidential information. If you have received it > in error, please notify the sender immediately and delete the original. Any > other use of the e-mail by you is prohibited. Where allowed by local law, > electronic communications with Accenture and its affiliates, including e-mail > and instant messaging (including content), may be scanned by our systems for > the purposes of information security and assessment of internal compliance > with Accenture policy. Your privacy is important to us. Accenture uses your > personal data only in compliance with data protection laws. For further > information on how Accenture processes your personal data, please see our > privacy statement at https://www.accenture.com/us-en/privacy-policy. > ______________________________________________________________________________________ > > www.accenture.com > <test.csv>