Re: java.lang.OutOfMemoryError while running Shark on Mesos

2014-05-23 Thread Akhil Das
Hi Prabeesh,

Do a export _JAVA_OPTIONS=-Xmx10g before starting the shark. Also you can
do a ps aux | grep shark and see how much memory it is being allocated,
mostly it should be 512mb, in that case increase the limit.

Thanks
Best Regards


On Fri, May 23, 2014 at 10:22 AM, prabeesh k prabsma...@gmail.com wrote:


 Hi,

 I am trying to apply  inner join in shark using 64MB and 27MB files. I am
 able to run the following queris on Mesos


- SELECT * FROM geoLocation1 



-  SELECT * FROM geoLocation1  WHERE  country =  'US' 


 But while trying inner join as

  SELECT * FROM geoLocation1 g1 INNER JOIN geoBlocks1 g2 ON (g1.locId =
 g2.locId)



 I am getting following error as follows.


 Exception in thread main org.apache.spark.SparkException: Job aborted:
 Task 1.0:7 failed 4 times (most recent failure: Exception failure:
 java.lang.OutOfMemoryError: Java heap space)
  at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018)
  at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018)
  at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604)
  at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604)
  at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Please help me to resolve this.

 Thanks in adv

 regards,
 prabeesh



No output from Spark Streaming program with Spark 1.0

2014-05-23 Thread Jim Donahue
I¹m trying out 1.0 on a set of small Spark Streaming tests and am running
into problems.  Here¹s one of the little programs I¹ve used for a long
time ‹ it reads a Kafka stream that contains Twitter JSON tweets and does
some simple counting.  The program starts OK (it connects to the Kafka
stream fine) and generates a stream of INFO logging messages, but never
generates any output. :-(

I¹m running this in Eclipse, so there may be some class loading issue
(loading the wrong class or something like that), but I¹m not seeing
anything in the console output.

Thanks,

Jim Donahue
Adobe



val kafka_messages =
  KafkaUtils.createStream[Array[Byte], Array[Byte],
kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc,
propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)


 val messages = kafka_messages.map(_._2)

 
 val total = ssc.sparkContext.accumulator(0)

 
 val startTime = new java.util.Date().getTime()

 
 val jsonstream = messages.map[JSONObject](message =
  {val string = new String(message);
  val json = new JSONObject(string);
  total += 1
  json
  }
)


val deleted = ssc.sparkContext.accumulator(0)


val msgstream = jsonstream.filter(json =
  if (!json.has(delete)) true else { deleted += 1; false}
  )


msgstream.foreach(rdd = {
  if(rdd.count()  0){
  val data = rdd.map(json = (json.has(entities),
json.length())).collect()
  val entities: Double = data.count(t = t._1)
  val fieldCounts = data.sortBy(_._2)
  val minFields = fieldCounts(0)._2
  val maxFields = fieldCounts(fieldCounts.size - 1)._2
  val now = new java.util.Date()
  val interval = (now.getTime() - startTime) / 1000
  System.out.println(now.toString)
  System.out.println(processing time:  + interval +  seconds)
  System.out.println(total messages:  + total.value)
  System.out.println(deleted messages:  + deleted.value)
  System.out.println(message receipt rate:  + (total.value/interval)
+  per second)
  System.out.println(messages this interval:  + data.length)
  System.out.println(message fields varied between:  + minFields + 
and  + maxFields)
  System.out.println(fraction with entities is  + (entities /
data.length))
  }
}
)

ssc.start()



Re: No output from Spark Streaming program with Spark 1.0

2014-05-23 Thread Patrick Wendell
Also one other thing to try, try removing all of the logic form inside
of foreach and just printing something. It could be that somehow an
exception is being triggered inside of your foreach block and as a
result the output goes away.

On Fri, May 23, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Jim,

 Do you see the same behavior if you run this outside of eclipse?

 Also, what happens if you print something to standard out when setting
 up your streams (i.e. not inside of the foreach) do you see that? This
 could be a streaming issue, but it could also be something related to
 the way it's running in eclipse.

 - Patrick

 On Fri, May 23, 2014 at 2:57 PM, Jim Donahue jdona...@adobe.com wrote:
 I¹m trying out 1.0 on a set of small Spark Streaming tests and am running
 into problems.  Here¹s one of the little programs I¹ve used for a long
 time ‹ it reads a Kafka stream that contains Twitter JSON tweets and does
 some simple counting.  The program starts OK (it connects to the Kafka
 stream fine) and generates a stream of INFO logging messages, but never
 generates any output. :-(

 I¹m running this in Eclipse, so there may be some class loading issue
 (loading the wrong class or something like that), but I¹m not seeing
 anything in the console output.

 Thanks,

 Jim Donahue
 Adobe



 val kafka_messages =
   KafkaUtils.createStream[Array[Byte], Array[Byte],
 kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc,
 propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)


  val messages = kafka_messages.map(_._2)


  val total = ssc.sparkContext.accumulator(0)


  val startTime = new java.util.Date().getTime()


  val jsonstream = messages.map[JSONObject](message =
   {val string = new String(message);
   val json = new JSONObject(string);
   total += 1
   json
   }
 )


 val deleted = ssc.sparkContext.accumulator(0)


 val msgstream = jsonstream.filter(json =
   if (!json.has(delete)) true else { deleted += 1; false}
   )


 msgstream.foreach(rdd = {
   if(rdd.count()  0){
   val data = rdd.map(json = (json.has(entities),
 json.length())).collect()
   val entities: Double = data.count(t = t._1)
   val fieldCounts = data.sortBy(_._2)
   val minFields = fieldCounts(0)._2
   val maxFields = fieldCounts(fieldCounts.size - 1)._2
   val now = new java.util.Date()
   val interval = (now.getTime() - startTime) / 1000
   System.out.println(now.toString)
   System.out.println(processing time:  + interval +  seconds)
   System.out.println(total messages:  + total.value)
   System.out.println(deleted messages:  + deleted.value)
   System.out.println(message receipt rate:  + (total.value/interval)
 +  per second)
   System.out.println(messages this interval:  + data.length)
   System.out.println(message fields varied between:  + minFields + 
 and  + maxFields)
   System.out.println(fraction with entities is  + (entities /
 data.length))
   }
 }
 )

 ssc.start()



Re: No output from Spark Streaming program with Spark 1.0

2014-05-23 Thread Tathagata Das
Few more suggestions.
1. See the web ui, is the system running any jobs? If not, then you may
need to give the system more nodes. Basically the system should have more
cores than the number of receivers.
2. Furthermore there is a streaming specific web ui which gives more
streaming specific data.


On Fri, May 23, 2014 at 6:02 PM, Patrick Wendell pwend...@gmail.com wrote:

 Also one other thing to try, try removing all of the logic form inside
 of foreach and just printing something. It could be that somehow an
 exception is being triggered inside of your foreach block and as a
 result the output goes away.

 On Fri, May 23, 2014 at 6:00 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Jim,
 
  Do you see the same behavior if you run this outside of eclipse?
 
  Also, what happens if you print something to standard out when setting
  up your streams (i.e. not inside of the foreach) do you see that? This
  could be a streaming issue, but it could also be something related to
  the way it's running in eclipse.
 
  - Patrick
 
  On Fri, May 23, 2014 at 2:57 PM, Jim Donahue jdona...@adobe.com wrote:
  I¹m trying out 1.0 on a set of small Spark Streaming tests and am
 running
  into problems.  Here¹s one of the little programs I¹ve used for a long
  time ‹ it reads a Kafka stream that contains Twitter JSON tweets and
 does
  some simple counting.  The program starts OK (it connects to the Kafka
  stream fine) and generates a stream of INFO logging messages, but never
  generates any output. :-(
 
  I¹m running this in Eclipse, so there may be some class loading issue
  (loading the wrong class or something like that), but I¹m not seeing
  anything in the console output.
 
  Thanks,
 
  Jim Donahue
  Adobe
 
 
 
  val kafka_messages =
KafkaUtils.createStream[Array[Byte], Array[Byte],
  kafka.serializer.DefaultDecoder, kafka.serializer.DefaultDecoder](ssc,
  propsMap, topicMap, StorageLevel.MEMORY_AND_DISK)
 
 
   val messages = kafka_messages.map(_._2)
 
 
   val total = ssc.sparkContext.accumulator(0)
 
 
   val startTime = new java.util.Date().getTime()
 
 
   val jsonstream = messages.map[JSONObject](message =
{val string = new String(message);
val json = new JSONObject(string);
total += 1
json
}
  )
 
 
  val deleted = ssc.sparkContext.accumulator(0)
 
 
  val msgstream = jsonstream.filter(json =
if (!json.has(delete)) true else { deleted += 1; false}
)
 
 
  msgstream.foreach(rdd = {
if(rdd.count()  0){
val data = rdd.map(json = (json.has(entities),
  json.length())).collect()
val entities: Double = data.count(t = t._1)
val fieldCounts = data.sortBy(_._2)
val minFields = fieldCounts(0)._2
val maxFields = fieldCounts(fieldCounts.size - 1)._2
val now = new java.util.Date()
val interval = (now.getTime() - startTime) / 1000
System.out.println(now.toString)
System.out.println(processing time:  + interval +  seconds)
System.out.println(total messages:  + total.value)
System.out.println(deleted messages:  + deleted.value)
System.out.println(message receipt rate:  +
 (total.value/interval)
  +  per second)
System.out.println(messages this interval:  + data.length)
System.out.println(message fields varied between:  + minFields
 + 
  and  + maxFields)
System.out.println(fraction with entities is  + (entities /
  data.length))
}
  }
  )
 
  ssc.start()