You are correct that it does not take the standard JSON file format. From the
Spark Docs:
"Note that the file that is offered as a json file is not a typical JSON file.
Each line must contain a separate, self-contained valid JSON object. As a
consequence, a regular multi-line JSON file will
Is there any support for dropping a nested column in a dataframe? I have tried
dropping with the Column reference as well as a string of the column name, but
the returned dataframe is unchanged.
>>> df = sqlContext.jsonRDD(sc.parallelize(['{"properties": {"col1": "a",
>>> "col2": "b"}}']))
>>>
Meant to include:
I have this function which seems to work, but I am not sure if it is always
correct:
def octet_length(s):
return len(s.encode(‘utf8’))
sqlContext.registerFunction('octet_length', lambda x: octet_length(x))
> On Mar 8, 2016, at 12:30 PM, Cramblit, Ross (Reuters News)
>
I am trying to define a UDF to calculate octet_length of a string but I am
having some trouble getting it right. Does anyone have a working version of
this already/any pointers?
I am using Spark 1.5.2/Python 2.7.
Thanks
-
To
I’m trying to save a data frame in Avro format but am getting the following
error:
java.lang.NoSuchMethodError:
org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
I found the following workaround
Hey Velijko,
I ran into this error a few days ago and it turned out I was out of disk space
on a couple nodes. I am not sure if this was the direct cause of the error, but
it stopped throwing when I cleared out some unneeded large files.
On Dec 14, 2015, at 5:32 PM, Veljko Skarich
Hey Sourav,
Window functions require using a HiveContext rather than the default
SQLContext. See here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext
HiveContext provides all the same functionality of SQLContext, as well as extra
features like Window
Okay maybe these errors are more helpful -
WARN server.TransportChannelHandler: Exception in connection from
ip-10-0-0-138.ec2.internal/10.0.0.138:39723
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at
I have looked through the logs and do not see any WARNING or ERRORs - the
executors just seem to stop logging.
I am running Spark 1.5.2 on YARN.
On Dec 7, 2015, at 1:20 PM, Ted Yu
> wrote:
bq. complete a shuffle stage due to lost executors
Have
Here is the trace I get from the command line:
[Stage 4:> (60 + 60) /
200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it
turns out there are a number of duplicate JSON objects in the source data. I am
trying to find the best way to remove these duplicates before using the
dataframe.
With both df.dropDuplicates() and
Thank you Ted and Sandy for getting me pointed in the right direction. From the
logs:
WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits.
25.4 GB of 25.3 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
On Nov 19, 2015, at 12:20 PM,
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu
> wrote:
Do you have YARN log aggregation enabled ?
You can try retrieving log for
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL
transforms on a JSON data set that I load into a data frame. The data set is
not large (~100GB) and most stages execute without any issues. However, some
more complex stages tend to lose executors/nodes regularly. What
Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between
consecutive events, using lag(event_time) over window -
SELECT device_id,
unix_time,
event_id,
unix_time - lag(unix_time)
OVER
(PARTITION BY device_id ORDER
I am using Spark 1.5.0 on Yarn
On Nov 2, 2015, at 3:16 PM, Yin Huai
> wrote:
Hi Ross,
What version of spark are you using? There were two issues that affected the
results of window function in Spark 1.5 branch. Both of issues have been fixed
16 matches
Mail list logo