Re: confusing about Spark SQL json format

2016-03-31 Thread Ross.Cramblit
You are correct that it does not take the standard JSON file format. From the 
Spark Docs:
"Note that the file that is offered as a json file is not a typical JSON file. 
Each line must contain a separate, self-contained valid JSON object. As a 
consequence, a regular multi-line JSON file will most often fail.”

http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets

On Mar 31, 2016, at 5:30 AM, charles li 
> wrote:

hi, UMESH, have you tried to load that json file on your machine? I did try it 
before, and here is the screenshot:

<屏幕快照 2016-03-31 下午5.27.30.png>
<屏幕快照 2016-03-31 下午5.27.39.png>
​
​




On Thu, Mar 31, 2016 at 5:19 PM, UMESH CHAUDHARY 
> wrote:
Hi Charles,
The definition of object from 
www.json.org:

An object is an unordered set of name/value pairs. An object begins with { 
(left brace) and ends with } (right brace). Each name is followed by : (colon) 
and the name/value pairs are separated by , (comma).

Its a pretty much OOPS paradigm , isn't it?

Regards,
Umesh

On Thu, Mar 31, 2016 at 2:34 PM, charles li 
> wrote:
hi, UMESH, I think you've misunderstood the json definition.

there is only one object in a json file:


for the file, people.json, as bellow:



{"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}

---

it does have two valid format:

1.



[ {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}},
{"name":"Michael", "address":{"city":null, "state":"California"}}
]

---

2.



{"name": ["Yin", "Michael"],
"address":[ {"city":"Columbus","state":"Ohio"},
{"city":null, "state":"California"} ]
}
---



On Thu, Mar 31, 2016 at 4:53 PM, UMESH CHAUDHARY 
> wrote:
Hi,
Look at below image which is from 
json.org
 :



The above image describes the object formulation of below JSON:

Object 1=> {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}}
Object=> {"name":"Michael", "address":{"city":null, "state":"California"}}


Note that "address" is also an object.



On Thu, Mar 31, 2016 at 1:53 PM, charles li 
> wrote:
as this post  says, that in spark, we can load a json file in this way bellow:

post : 
https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html


---
sqlContext.jsonFile(file_path)
or
sqlContext.read.json(file_path)
---


and the json file format looks like bellow, say people.json

{"name":"Yin",
 "address":{"city":"Columbus","state":"Ohio"}}
{"name":"Michael", "address":{"city":null, "state":"California"}}
---


and here comes my problems:

Is that the standard json format? according to 
http://www.json.org/
 , I don't think so. it's just a collection of records [ a dict ], not a valid 
json format. as the json 

Dropping nested dataframe column

2016-03-10 Thread Ross.Cramblit
Is there any support for dropping a nested column in a dataframe? I have tried 
dropping with the Column reference as well as a string of the column name, but 
the returned dataframe is unchanged.

>>> df = sqlContext.jsonRDD(sc.parallelize(['{"properties": {"col1": "a", 
>>> "col2": "b"}}']))
>>> df.printSchema()
root
 |-- properties: struct (nullable = true)
 ||-- col1: string (nullable = true)
 ||-- col2: string (nullable = true)

>>> df.drop(df['properties']['col1']).printSchema()
root
 |-- properties: struct (nullable = true)
 ||-- col1: string (nullable = true)
 ||-- col2: string (nullable = true)

>>> df.drop('col1').printSchema()
root
 |-- properties: struct (nullable = true)
 ||-- col1: string (nullable = true)
 ||-- col2: string (nullable = true)


Re: PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
Meant to include:

I have this function which seems to work, but I am not sure if it is always 
correct:

def octet_length(s):
return len(s.encode(‘utf8’))

sqlContext.registerFunction('octet_length', lambda x: octet_length(x))

> On Mar 8, 2016, at 12:30 PM, Cramblit, Ross (Reuters News) 
>  wrote:
> 
> I am trying to define a UDF to calculate octet_length of a string but I am 
> having some trouble getting it right. Does anyone have a working version of 
> this already/any pointers?
> 
> I am using Spark 1.5.2/Python 2.7.
> 
> Thanks
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



PySpark/SQL Octet Length

2016-03-08 Thread Ross.Cramblit
I am trying to define a UDF to calculate octet_length of a string but I am 
having some trouble getting it right. Does anyone have a working version of 
this already/any pointers?

I am using Spark 1.5.2/Python 2.7.

Thanks


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark-avro issue in 1.5.2

2016-02-24 Thread Ross.Cramblit
I’m trying to save a data frame in Avro format but am getting the following 
error:

  java.lang.NoSuchMethodError: 
org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;

I found the following workaround 
https://github.com/databricks/spark-avro/issues/91 - which seems to say that 
this is from a mismatch in Avro versions. I have tried following both solutions 
detailed to no avail:
 - Manually downloading avro-1.7.7.jar and including it in 
/usr/lib/hadoop-mapreduce/
 - Adding avro-1.7.7.jar to spark.driver.extraClassPath and 
spark.executor.extraClassPath
 - The same with avro-1.6.6

I am still getting the same error, and now I am just stabbing in the dark. 
Anyone else still running into this issue?


I am using Pyspark 1.5.2 on EMR.


Re: troubleshooting "Missing an output location for shuffle"

2015-12-14 Thread Ross.Cramblit
Hey Velijko,
I ran into this error a few days ago and it turned out I was out of disk space 
on a couple nodes. I am not sure if this was the direct cause of the error, but 
it stopped throwing when I cleared out some unneeded large files.


On Dec 14, 2015, at 5:32 PM, Veljko Skarich 
> wrote:

Hi,

I keep getting some variation of the following error:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 2


Does anyone know what this might indicate? Is it a memory issue? Any general 
guidance appreciated.



Re: Window function in Spark SQL

2015-12-11 Thread Ross.Cramblit
Hey Sourav,
Window functions require using a HiveContext rather than the default 
SQLContext. See here: 
http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sqlcontext

HiveContext provides all the same functionality of SQLContext, as well as extra 
features like Window functions.

- Ross

On Dec 11, 2015, at 12:59 PM, Sourav Mazumder 
> wrote:

Hi,

Spark SQL documentation says that it complies with Hive 1.2.1 APIs and supports 
Window functions. I'm using Spark 1.5.0.

However, when I try to execute something like below I get an error

val lol5 = sqlContext.sql("select ky, lead(ky, 5, 0) over (order by ky rows 5 
following) from lolt")

java.lang.RuntimeException: [1.32] failure: ``union'' expected but `(' found 
select ky, lead(ky, 5, 0) over (order by ky rows 5 following) from lolt ^ at 
scala.sys.package$.error(package.scala:27) at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
 at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
 at org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169) at 
org.apache.spark.sql.SQLContext$$anonfun$3.apply(SQLContext.scala:169) at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:115)
 at 
org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
 at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at 
scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
 at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166) at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:166) at 
org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:42) 
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:189) at 
org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:63)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:68)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:70)
 at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:72)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:74) 
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:76) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:78) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:80) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:82) at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:84) at 
$iwC$$iwC$$iwC$$iwC$$iwC.(:86) at 
$iwC$$iwC$$iwC$$iwC.(:88) at $iwC$$iwC$$iwC.(:90) 
at $iwC$$iwC.(:92) at $iwC.(:94) at 
(:96) at .(:100) at .() at 
.(:7) at .() at $print()

Regards,
Sourav



Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Okay maybe these errors are more helpful -

WARN server.TransportChannelHandler: Exception in connection from 
ip-10-0-0-138.ec2.internal/10.0.0.138:39723
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/12/07 21:25:42 ERROR client.TransportResponseHandler: Still have 5 requests 
outstanding when connection from ip-10-0-0-138.ec2.internal/10.0.0.138:39723 is 
closed
15/12/07 21:25:42 ERROR shuffle.OneForOneBlockFetcher: Failed while starting 
block fetches
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/12/07 21:25:42 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 
39 outstanding blocks after 5000 ms
15/12/07 21:25:42 ERROR shuffle.OneForOneBlockFetcher: Failed while starting 
block fetches
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)

That continues for a while.

There is also this error on the Stage status from the Spark History server:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 1
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 

Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have looked through the logs and do not see any WARNING or ERRORs - the 
executors just seem to stop logging.

I am running Spark 1.5.2 on YARN.

On Dec 7, 2015, at 1:20 PM, Ted Yu 
> wrote:

bq. complete a shuffle stage due to lost executors

Have you taken a look at the log for the lost executor(s) ?

Which release of Spark are you using ?

Cheers

On Mon, Dec 7, 2015 at 10:12 AM, 
> 
wrote:
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it 
turns out there are a number of duplicate JSON objects in the source data. I am 
trying to find the best way to remove these duplicates before using the 
dataframe.

With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) 
the application is not able to complete a shuffle stage due to lost executors. 
Is there a more efficient way to remove these duplicate rows? If not, what 
settings can I tweak to help this succeed? I have tried both increasing and 
decreasing the number of default shuffle partitions (to 100 and 500, 
respectively) but neither changes the behavior.
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org





Re: Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
Here is the trace I get from the command line:
[Stage 4:>  (60 + 60) / 
200]15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: 10.0.0.138:33822
15/12/07 18:59:40 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkYarnAM@10.0.0.138:33822] has failed, address is now 
gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-138.ec2.internal:54951] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]
15/12/07 18:59:41 ERROR YarnScheduler: Lost executor 3 on 
ip-10-0-0-138.ec2.internal: remote Rpc client disassociated
15/12/07 18:59:41 WARN TaskSetManager: Lost task 62.0 in stage 4.0 (TID 2003, 
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
15/12/07 18:59:41 WARN TaskSetManager: Lost task 65.0 in stage 4.0 (TID 2006, 
ip-10-0-0-138.ec2.internal): ExecutorLostFailure (executor 3 lost)
…
…



On Dec 7, 2015, at 1:33 PM, Cramblit, Ross (Reuters News) 
> 
wrote:

I have looked through the logs and do not see any WARNING or ERRORs - the 
executors just seem to stop logging.

I am running Spark 1.5.2 on YARN.

On Dec 7, 2015, at 1:20 PM, Ted Yu 
> wrote:

bq. complete a shuffle stage due to lost executors

Have you taken a look at the log for the lost executor(s) ?

Which release of Spark are you using ?

Cheers

On Mon, Dec 7, 2015 at 10:12 AM, 
> 
wrote:
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it 
turns out there are a number of duplicate JSON objects in the source data. I am 
trying to find the best way to remove these duplicates before using the 
dataframe.

With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) 
the application is not able to complete a shuffle stage due to lost executors. 
Is there a more efficient way to remove these duplicate rows? If not, what 
settings can I tweak to help this succeed? I have tried both increasing and 
decreasing the number of default shuffle partitions (to 100 and 500, 
respectively) but neither changes the behavior.
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org






Removing duplicates from dataframe

2015-12-07 Thread Ross.Cramblit
I have pyspark app loading a large-ish (100GB) dataframe from JSON files and it 
turns out there are a number of duplicate JSON objects in the source data. I am 
trying to find the best way to remove these duplicates before using the 
dataframe.

With both df.dropDuplicates() and df.sqlContext.sql(‘’’SELECT DISTINCT *…’’’) 
the application is not able to complete a shuffle stage due to lost executors. 
Is there a more efficient way to remove these duplicate rows? If not, what 
settings can I tweak to help this succeed? I have tried both increasing and 
decreasing the number of default shuffle partitions (to 100 and 500, 
respectively) but neither changes the behavior.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Thank you Ted and Sandy for getting me pointed in the right direction. From the 
logs:

WARN yarn.YarnAllocator: Container killed by YARN for exceeding memory limits. 
25.4 GB of 25.3 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.


On Nov 19, 2015, at 12:20 PM, Ted Yu 
> wrote:

Here are the parameters related to log aggregation :


  yarn.log-aggregation-enable
  true



  yarn.log-aggregation.retain-seconds
  2592000


  yarn.nodemanager.log-aggregation.compression-type
  gz



  yarn.nodemanager.log-aggregation.debug-enabled
  false



  yarn.nodemanager.log-aggregation.num-log-files-per-app
  30



  
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
  -1


On Thu, Nov 19, 2015 at 8:14 AM, 
> 
wrote:
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any 
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu 
> wrote:

Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176 -containerId 
container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, 
> 
wrote:
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information -

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.






Re: PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
Hmm I guess I do not - I get 'application_1445957755572_0176 does not have any 
log files.’ Where can I enable log aggregation?
On Nov 19, 2015, at 11:07 AM, Ted Yu 
> wrote:

Do you have YARN log aggregation enabled ?

You can try retrieving log for the container using the following command:

yarn logs -applicationId application_1445957755572_0176 -containerId 
container_1445957755572_0176_01_03

Cheers

On Thu, Nov 19, 2015 at 8:02 AM, 
> 
wrote:
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information -

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.




PySpark Lost Executors

2015-11-19 Thread Ross.Cramblit
I am running Spark 1.5.2 on Yarn. My job consists of a number of SparkSQL 
transforms on a JSON data set that I load into a data frame. The data set is 
not large (~100GB) and most stages execute without any issues. However, some 
more complex stages tend to lose executors/nodes regularly. What would cause 
this to happen? The logs don’t give too much information - 

15/11/19 15:53:43 ERROR YarnScheduler: Lost executor 2 on 
ip-10-0-0-136.ec2.internal: Yarn deallocated the executor 2 (container 
container_1445957755572_0176_01_03)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 142.0 in stage 33.0 (TID 8331, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 133.0 in stage 33.0 (TID 8322, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 79.0 in stage 33.0 (TID 8268, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 141.0 in stage 33.0 (TID 8330, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 123.0 in stage 33.0 (TID 8312, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 162.0 in stage 33.0 (TID 8351, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 153.0 in stage 33.0 (TID 8342, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 120.0 in stage 33.0 (TID 8309, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 149.0 in stage 33.0 (TID 8338, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
15/11/19 15:53:43 WARN TaskSetManager: Lost task 134.0 in stage 33.0 (TID 8323, 
ip-10-0-0-136.ec2.internal): ExecutorLostFailure (executor 2 lost)
[Stage 33:===> (117 + 50) / 
200]15/11/19 15:53:46 WARN ReliableDeliverySupervisor: Association with remote 
system [akka.tcp://sparkExecutor@ip-10-0-0-136.ec2.internal:60275] has failed, 
address is now gated for [5000] ms. Reason: [Disassociated]

 - Followed by a list of lost tasks on each executor.

Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between 
consecutive events, using lag(event_time) over window -


SELECT device_id,
   unix_time,
   event_id,
   unix_time - lag(unix_time)
  OVER
(PARTITION BY device_id ORDER BY unix_time,event_id)
 AS seconds_since_last_event
FROM ios_d_events;

This is giving me some strange results in the case where the first two events 
for a particular device_id have the same timestamp.
I used to following query to take a look at what value was being returned by 
lag():

SELECT device_id,
   event_time,
   unix_time,
   event_id,
   lag(event_time) OVER (PARTITION BY device_id ORDER BY 
unix_time,event_id) AS lag_time
FROM ios_d_events;

I’m seeing that in these cases, I am getting something like 1970-01-03 … 
instead of a null value, and the following lag times are all following the same 
format.

I posted a section of this output in this SO question: 
http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls

The errant results are labeled with device_id 999.

Any idea why this is occurring?

- Ross


Re: Spark SQL lag() window function, strange behavior

2015-11-02 Thread Ross.Cramblit
I am using Spark 1.5.0 on Yarn

On Nov 2, 2015, at 3:16 PM, Yin Huai 
> wrote:

Hi Ross,

What version of spark are you using? There were two issues that affected the 
results of window function in Spark 1.5 branch. Both of issues have been fixed 
and will be released with Spark 1.5.2 (this release will happen soon). For more 
details of these two issues, you can take a look at 
https://issues.apache.org/jira/browse/SPARK-11135
 and 
https://issues.apache.org/jira/browse/SPARK-11009.

Thanks,

Yin

On Mon, Nov 2, 2015 at 12:07 PM, 
> 
wrote:
Hello Spark community -
I am running a Spark SQL query to calculate the difference in time between 
consecutive events, using lag(event_time) over window -


SELECT device_id,
   unix_time,
   event_id,
   unix_time - lag(unix_time)
  OVER
(PARTITION BY device_id ORDER BY unix_time,event_id)
 AS seconds_since_last_event
FROM ios_d_events;

This is giving me some strange results in the case where the first two events 
for a particular device_id have the same timestamp.
I used to following query to take a look at what value was being returned by 
lag():

SELECT device_id,
   event_time,
   unix_time,
   event_id,
   lag(event_time) OVER (PARTITION BY device_id ORDER BY 
unix_time,event_id) AS lag_time
FROM ios_d_events;

I’m seeing that in these cases, I am getting something like 1970-01-03 … 
instead of a null value, and the following lag times are all following the same 
format.

I posted a section of this output in this SO question: 
http://stackoverflow.com/questions/33482167/spark-sql-window-function-lag-giving-unexpected-resutls

The errant results are labeled with device_id 999.

Any idea why this is occurring?

- Ross