How to deal with context dependent computing?

2018-08-22 Thread JF Chen
For example, I have some data with timstamp marked as category A and B, and ordered by time. Now I want to calculate each duration from A to B. In normal program, I can use the flag bit to record the preview data if it is A or B, and then calculate the duration. But in Spark Dataframe, how to do

About the question of Spark Structured Streaming window output

2018-08-22 Thread z...@zjdex.com
Hi : I have some questions about spark structured streaming window output in spark 2.3.1. I write the application code as following: case class DataType(time:Timestamp, value:Long) {} val spark = SparkSession .builder .appName("StructuredNetworkWordCount")

Re: How to merge multiple rows

2018-08-22 Thread Patrick McCarthy
You didn't specify which API, but in pyspark you could do import pyspark.sql.functions as F df.groupBy('ID').agg(F.sort_array(F.collect_set('DETAILS')).alias('DETAILS')).show() +---++ | ID| DETAILS| +---++ | 1|[A1, A2, A3]| | 3|[B2]| | 2|[B1]|

Re: How to merge multiple rows

2018-08-22 Thread Jean Georges Perrin
How do you do it now? You could use a withColumn(“newDetails”, ) jg > On Aug 22, 2018, at 16:04, msbreuer wrote: > > A dataframe with following contents is given: > > ID PART DETAILS > 11 A1 > 12 A2 > 13 A3 > 21 B1 > 31 C1 > > Target format should be as following: >

How to merge multiple rows

2018-08-22 Thread msbreuer
A dataframe with following contents is given: ID PART DETAILS 11 A1 12 A2 13 A3 21 B1 31 C1 Target format should be as following: ID DETAILS 1 A1+A2+A3 2 B1 3 C1 Note, the order of A1-3 is important. Currently I am using this alternative: ID DETAIL_1 DETAIL_2

Re: No space left on device

2018-08-22 Thread Gourav Sengupta
Hi, that was just one of the options, and not the first one, is there any chance of trying out the other options mentioned? For example, pointing the shuffle storage area to a location with larger space? Regards, Gourav Sengupta On Wed, Aug 22, 2018 at 11:15 AM Vitaliy Pisarev <

Re: No space left on device

2018-08-22 Thread Vitaliy Pisarev
Documentation says that 'spark.shuffle.memoryFraction' was deprecated, but it doesn't say what to use instead. Any idea? On Wed, Aug 22, 2018 at 9:36 AM, Gourav Sengupta wrote: > Hi, > > The best part about Spark is that it is showing you which configuration to > tweak as well. In case you are

: Failed to create file system watcher service: User limit of inotify instances reached or too many open files

2018-08-22 Thread Polisetti, Venkata Siva Rama Gopala Krishna
Hi, When I am doing calculations for example 700 listID's it is saving only some 50 rows and then getting some random exceptions Getting below exception when I try to do calculations on huge data and try to save huge data . Please let me know if any suggestions. Sample Code : I have some

Re: No space left on device

2018-08-22 Thread Gourav Sengupta
Hi, The best part about Spark is that it is showing you which configuration to tweak as well. In case you are using EMR, try to see that the configuration points to the right location in the cluster "spark.local.dir". If a disk is mounted across all the systems with a common path (you can do that