Re: confusing about Spark SQL json format

2016-03-31 Thread Ross.Cramblit
You are correct that it does not take the standard JSON file format. From the Spark Docs: "Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will

Dropping nested dataframe column

2016-03-10 Thread Ross.Cramblit
Is there any support for dropping a nested column in a dataframe? I have tried dropping with the Column reference as well as a string of the column name, but the returned dataframe is unchanged. >>> df = sqlContext.jsonRDD(sc.parallelize(['{"properties": {"col1": "a", >>> "col2": "b"}}'])) >>>

Re: PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
Meant to include: I have this function which seems to work, but I am not sure if it is always correct: def octet_length(s): return len(s.encode(‘utf8’)) sqlContext.registerFunction('octet_length', lambda x: octet_length(x)) > On Mar 8, 2016, at 12:30 PM, Cramblit, Ross (Reuters News) >

PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
I am trying to define a UDF to calculate octet_length of a string but I am having some trouble getting it right. Does anyone have a working version of this already/any pointers? I am using Spark 1.5.2/Python 2.7. Thanks - To

Spark-avro issue in 1.5.2

2016-02-24 Thread Ross.Cramblit
I’m trying to save a data frame in Avro format but am getting the following error: java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter; I found the following workaround

Re: troubleshooting "Missing an output location for shuffle"

2015-12-14 Thread Ross.Cramblit
Hey Velijko, I ran into this error a few days ago and it turned out I was out of disk space on a couple nodes. I am not sure if this was the direct cause of the error, but it stopped throwing when I cleared out some unneeded large files. On Dec 14, 2015, at 5:32 PM, Veljko Skarich

Re: Window function in Spark SQL

2015-12-11 Thread Ross.Cramblit
Hey Sourav, Window functions require using a HiveContext rather than the default SQLContext. See here: http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext HiveContext provides all the same functionality of SQLContext, as well as extra features like Window

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Okay maybe these errors are more helpful - WARN server.TransportChannelHandler: Exception in connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have looked through the logs and do not see any WARNING or ERRORs - the executors just seem to stop logging. I am running Spark 1.5.2 on YARN. On Dec 7, 2015, at 1:20 PM, Ted Yu > wrote: bq. complete a shuffle stage due to lost executors Have

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Here is the trace I get from the command line: [Stage 4:> (60 + 60) / 200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: 10.0.0.138:33822 15/12/07 18:59:40 WARN

Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it turns out there are a number of duplicate JSON objects in the source data. I am trying to find the best way to remove these duplicates before using the dataframe. With both df.dropDuplicates() and

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Thank you Ted and Sandy for getting me pointed in the right direction. From the logs: WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 25.4 GB of 25.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. On Nov 19, 2015, at 12:20 PM,

Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any log files.’ Where can I enable log aggregation? On Nov 19, 2015, at 11:07 AM, Ted Yu > wrote: Do you have YARN log aggregation enabled ? You can try retrieving log for

PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL transforms on a JSON data set that I load into a data frame. The data set is not large (~100GB) and most stages execute without any issues. However, some more complex stages tend to lose executors/nodes regularly. What

Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
Hello Spark community - I am running a Spark SQL query to calculate the difference in time between consecutive events, using lag(event_time) over window - SELECT device_id, unix_time, event_id, unix_time - lag(unix_time) OVER (PARTITION BY device_id ORDER

Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
I am using Spark 1.5.0 on Yarn On Nov 2, 2015, at 3:16 PM, Yin Huai > wrote: Hi Ross, What version of spark are you using? There were two issues that affected the results of window function in Spark 1.5 branch. Both of issues have been fixed