Re: Error Compiling

2015-01-30 Thread Akhil Das
This is how i do it:

val tmp = test.map(x => (x, 1L)).reduceByWindow({ case ((word1, count1),
(word2, count2)) => (word1 + " " + word2, count1 + count2)}, Seconds(10),
Seconds(10))


In your case you are actually having a type mismatch:

[image: Inline image 1]



Thanks
Best Regards

On Sat, Jan 31, 2015 at 5:30 AM, Eduardo Costa Alfaia <
e.costaalf...@unibs.it> wrote:

> Hi Guys,
>
> some idea how solve this error
>
> [error]
> /sata_disk/workspace/spark-1.1.1/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala:76:
> missing parameter type for expanded function ((x$6, x$7) => x$6.$plus(x$7))
>
> [error] val wordCounts = words.map(x => (x, 1L)).reduceByWindow(_ +
> _, _ - _, Minutes(1), Seconds(2), 2)
>
> Thanks
>
> Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Long pauses after writing to sequence files

2015-01-30 Thread Akhil Das
Not quiet sure, but it could be the GC Pause, if you are holding too much
objects in memory. You can check this tuning
 part if you haven't
already been through it.

Thanks
Best Regards

On Sat, Jan 31, 2015 at 7:22 AM, Corey Nolet  wrote:

> We have a series of spark jobs which run in succession over various cached
> datasets, do small groups and transforms, and then call
> saveAsSequenceFile() on them.
>
> Each call to save as a sequence file appears to have done its work, the
> task says it completed in "xxx.x seconds" but then it pauses and the
> pauses are quite significant- sometimes up to 2 minutes. We are trying to
> figure out what's going on during this pause- if the executors are really
> still writing to the sequence files or if maybe a race condition is
> happening on an executor which is causing timeouts.
>
> Any ideas? Anyone else seen this happening?
>
>
> We also tried running all the saveAsSequenceFile calls in separate futures
> to see if maybe the waiting would still only take 1-2 minutes but it looks
> like the waiting still takes the sum of the amount  of time it would have
> originally (several several minutes). Our job runs, in its entirety, 35
> minutes and we're estimating that we're spending at least 16 minutes in
> this paused state. What I haven't been able to do is figure out how to
> trace through all the executors. Is there a way to do this? The event logs
> in yarn don't seem to help much with this.
>


Re: measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Akhil Das
I believe From the webui (running on port 8080) you will get these
measurements.

Thanks
Best Regards

On Sat, Jan 31, 2015 at 9:29 AM, Josh J  wrote:

> Hi,
>
> I have a stream pipeline which invokes map, reduceByKey, filter, and
> flatMap. How can I measure the time taken in each stage?
>
> Thanks,
> Josh
>


Re: Build error

2015-01-30 Thread Tathagata Das
That is a known issue uncovered last week. It fails on certain
environments, not on Jenkins which is our testing environment.
There is already a PR up to fix it. For now you can build using "mvn
package -DskipTests"
TD

On Fri, Jan 30, 2015 at 8:59 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Off master, got this error; is that typical?
>
> ---
>  T E S T S
> ---
> Running org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.495 sec
> - in org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite
>
> Results :
>
>
>
>
> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
>
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (test) @
> spark-streaming-mqtt_2.10 ---
> Discovery starting.
> Discovery completed in 498 milliseconds.
> Run starting. Expected test count is: 1
> MQTTStreamSuite:
> - mqtt input stream *** FAILED ***
>   org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in
> progress
>   at
> org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:432)
>   at
> org.eclipse.paho.client.mqttv3.internal.ClientComms.internalSend(ClientComms.java:121)
>   at
> org.eclipse.paho.client.mqttv3.internal.ClientComms.sendNoWait(ClientComms.java:139)
>   at org.eclipse.paho.client.mqttv3.MqttTopic.publish(MqttTopic.java:107)
>   at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$publishData$1.apply(MQTTStreamSuite.scala:125)
>   at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$publishData$1.apply(MQTTStreamSuite.scala:124)
>   at scala.collection.immutable.Range.foreach(Range.scala:141)
>   at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite.publishData(MQTTStreamSuite.scala:124)
>   at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$3.apply$mcV$sp(MQTTStreamSuite.scala:78)
>   at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$3.apply(MQTTStreamSuite.scala:66)
>   ...
> Exception in thread "Thread-20" org.apache.spark.SparkException: Job
> cancelled because SparkContext was shut down
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:690)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:689)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:689)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1384)
> at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81)
> at
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1319)
> at org.apache.spark.SparkContext.stop(SparkContext.scala:1250)
> at
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:510)
> at
> org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:485)
> at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply$mcV$sp(MQTTStreamSuite.scala:59)
> at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply(MQTTStreamSuite.scala:57)
> at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply(MQTTStreamSuite.scala:57)
> at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:210)
> at
> org.apache.spark.streaming.mqtt.MQTTStreamSuite.runTest(MQTTStreamSuite.scala:38)
> at
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> at
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> at
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> at
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
> at org.scalatest.SuperEngine.org
> $scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
> at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
> at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
> at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
> at org.scalatest.Suite$class.run(Suite.scala:1424)
> at org.scalatest.FunSuite.org
> $scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
> at
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> at
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
> at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
> at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
> at org.apache.spark.streaming.mqtt.MQTTStreamSuite.org
> $scalatest$BeforeAndAfter$$super$run(MQTTStreamSuite.scala:38)
> at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)

Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-01-30 Thread Milad khajavi
Here is the same issues:
[1] 
http://stackoverflow.com/questions/28186607/java-lang-classcastexception-using-lambda-expressions-in-spark-job-on-remote-ser
[2] 
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJUHuJoE7nP6MMOJJKTL6kZtamQ=qhym1aozmezbnetla1y...@mail.gmail.com%3E#archives

Could you please explain your exact effort? show the code that you are
working on it?

On Thu, Jan 22, 2015 at 12:29 PM, thanhtien522  wrote:
> Update: I deployed a stand-alone spark in localhost then set Master as
> spark://localhost:7077 and it met the same issue
> Don't know how to solve it.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-YARN-java-lang-ClassCastException-SerializedLambda-to-org-apache-spark-api-java-function-Fu1-tp21261p21315.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 
Milād Khājavi
http://blog.khajavi.ir
Having the source means you can do it yourself.
I tried to change the world, but I couldn’t find the source code.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Build error

2015-01-30 Thread Andrew Musselman
Off master, got this error; is that typical?

---
 T E S T S
---
Running org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.495 sec -
in org.apache.spark.streaming.mqtt.JavaMQTTStreamSuite

Results :




Tests run: 1, Failures: 0, Errors: 0, Skipped: 0

[INFO]
[INFO] --- scalatest-maven-plugin:1.0:test (test) @
spark-streaming-mqtt_2.10 ---
Discovery starting.
Discovery completed in 498 milliseconds.
Run starting. Expected test count is: 1
MQTTStreamSuite:
- mqtt input stream *** FAILED ***
  org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in
progress
  at
org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:432)
  at
org.eclipse.paho.client.mqttv3.internal.ClientComms.internalSend(ClientComms.java:121)
  at
org.eclipse.paho.client.mqttv3.internal.ClientComms.sendNoWait(ClientComms.java:139)
  at org.eclipse.paho.client.mqttv3.MqttTopic.publish(MqttTopic.java:107)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$publishData$1.apply(MQTTStreamSuite.scala:125)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$publishData$1.apply(MQTTStreamSuite.scala:124)
  at scala.collection.immutable.Range.foreach(Range.scala:141)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite.publishData(MQTTStreamSuite.scala:124)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$3.apply$mcV$sp(MQTTStreamSuite.scala:78)
  at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$3.apply(MQTTStreamSuite.scala:66)
  ...
Exception in thread "Thread-20" org.apache.spark.SparkException: Job
cancelled because SparkContext was shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:690)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:689)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:689)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1384)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1319)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1250)
at
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:510)
at
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:485)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply$mcV$sp(MQTTStreamSuite.scala:59)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply(MQTTStreamSuite.scala:57)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite$$anonfun$2.apply(MQTTStreamSuite.scala:57)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:210)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite.runTest(MQTTStreamSuite.scala:38)
at
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.org
$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at org.scalatest.FunSuite.org
$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at org.apache.spark.streaming.mqtt.MQTTStreamSuite.org
$scalatest$BeforeAndAfter$$super$run(MQTTStreamSuite.scala:38)
at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
at
org.apache.spark.streaming.mqtt.MQTTStreamSuite.run(MQTTStreamSuite.scala:38)
at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492)
at
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528)
at
org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1526)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.scalatest.Suit

measuring time taken in map, reduceByKey, filter, flatMap

2015-01-30 Thread Josh J
Hi,

I have a stream pipeline which invokes map, reduceByKey, filter, and
flatMap. How can I measure the time taken in each stage?

Thanks,
Josh


Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-30 Thread Cheng Lian
Yeah, currently there isn't such a repo. However, the Spark team is 
working on this.


Cheng

On 1/30/15 8:19 AM, Ayoub wrote:

I am not personally aware of a repo for snapshot builds.
In my use case, I had to build spark 1.2.1-snapshot

see https://spark.apache.org/docs/latest/building-spark.html

2015-01-30 17:11 GMT+01:00 Debajyoti Roy <[hidden email] 
>:


Thanks Ayoub and Zhan,
I am new to spark and wanted to make sure i am not trying
something stupid or using a wrong API.

Is there a repo where i can pull the snapshot or nighly builds for
spark ?



On Fri, Jan 30, 2015 at 2:45 AM, Ayoub Benali <[hidden email]
> wrote:

Hello,

I had the same issue then I found this JIRA ticket
https://issues.apache.org/jira/browse/SPARK-4825
So I switched to Spark 1.2.1-snapshot witch solved the problem.



2015-01-30 8:40 GMT+01:00 Zhan Zhang <[hidden email]
>:

I think it is expected. Refer to the comments in
saveAsTable "Note that this currently only works with
SchemaRDDs that are created from a HiveContext”. If I
understand correctly, here the SchemaRDD means those
generated by HiveContext.sql, instead of applySchema.

Thanks.

Zhan Zhang



On Jan 29, 2015, at 9:38 PM, matroyd <[hidden email]
> wrote:


Hi, I am trying saveAsTable on SchemaRDD created from
HiveContext and it fails. This is on Spark 1.2.0.
Following are details of the code, command and
exceptions:

http://stackoverflow.com/questions/28222496/how-to-enable-sql-on-schemardd-via-the-jdbc-interface-is-it-even-possible


Thanks in advance for any guidance


View this message in context: HiveContext created
SchemaRDD's saveAsTable is not working on 1.2.0


Sent from the Apache Spark User List mailing list archive


at Nabble.com .






-- 
Thanks,


*Debajyoti Roy*
[hidden email] 
(646)561-0844
350 Madison Ave., FL 16,
New York, NY 10017.




View this message in context: Re: HiveContext created SchemaRDD's 
saveAsTable is not working on 1.2.0 

Sent from the Apache Spark User List mailing list archive 
 at Nabble.com.




Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced 
this issue locally, it should be bug, but I don't think it's related to 
SPARK-5236. Will investigate this soon.


Ayoub - would you mind to help to file a JIRA for this issue? Thanks!

Cheng

On 1/30/15 11:28 AM, Michael Armbrust wrote:
Is it possible that your schema contains duplicate columns or column 
with spaces in the name?  The parquet library will often give 
confusing error messages in this case.


On Fri, Jan 30, 2015 at 10:33 AM, Ayoub > wrote:


Hello,

I have a problem when querying, with a hive context on spark
1.2.1-snapshot, a column in my table which is nested data
structure like an array of struct.
The problems happens only on the table stored as parquet, while
querying the Schema RDD saved, as a temporary table, don't lead to
any exception.

my steps are:
1) reading JSON file
2) creating a schema RDD and saving it as a tmp table
3) creating an external table in hive meta store saved as parquet file
4) inserting the data from the tmp table to the persisted table
5) queering the persisted table lead to this exception:

"select data.field1 from persisted_table LATERAL VIEW
explode(data_array) nestedStuff AS data"

parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file hdfs://***/test_table/part-1
at

parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at

parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273)
   
atscala.collection.AbstractIterator.to

(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
at

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at
parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:99)
at parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:94)
at

parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:274)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at

parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
   

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced 
this issue locally, it should be bug, but I don't think it's related to 
SPARK-5236. Will investigate this soon.


Ayoub - would you mind to help to file a JIRA for this issue? Thanks!

Cheng

On 1/30/15 11:28 AM, Michael Armbrust wrote:
Is it possible that your schema contains duplicate columns or column 
with spaces in the name?  The parquet library will often give 
confusing error messages in this case.


On Fri, Jan 30, 2015 at 10:33 AM, Ayoub > wrote:


Hello,

I have a problem when querying, with a hive context on spark
1.2.1-snapshot, a column in my table which is nested data
structure like an array of struct.
The problems happens only on the table stored as parquet, while
querying the Schema RDD saved, as a temporary table, don't lead to
any exception.

my steps are:
1) reading JSON file
2) creating a schema RDD and saving it as a tmp table
3) creating an external table in hive meta store saved as parquet file
4) inserting the data from the tmp table to the persisted table
5) queering the persisted table lead to this exception:

"select data.field1 from persisted_table LATERAL VIEW
explode(data_array) nestedStuff AS data"

parquet.io.ParquetDecodingException: Can not read value at 0 in
block -1 in file hdfs://***/test_table/part-1
at

parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at

parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at

org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273)
   
atscala.collection.AbstractIterator.to

(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
at

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at
parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:99)
at parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:94)
at

parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:274)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at
parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at

parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
   

Long pauses after writing to sequence files

2015-01-30 Thread Corey Nolet
We have a series of spark jobs which run in succession over various cached
datasets, do small groups and transforms, and then call
saveAsSequenceFile() on them.

Each call to save as a sequence file appears to have done its work, the
task says it completed in "xxx.x seconds" but then it pauses and the
pauses are quite significant- sometimes up to 2 minutes. We are trying to
figure out what's going on during this pause- if the executors are really
still writing to the sequence files or if maybe a race condition is
happening on an executor which is causing timeouts.

Any ideas? Anyone else seen this happening?


We also tried running all the saveAsSequenceFile calls in separate futures
to see if maybe the waiting would still only take 1-2 minutes but it looks
like the waiting still takes the sum of the amount  of time it would have
originally (several several minutes). Our job runs, in its entirety, 35
minutes and we're estimating that we're spending at least 16 minutes in
this paused state. What I haven't been able to do is figure out how to
trace through all the executors. Is there a way to do this? The event logs
in yarn don't seem to help much with this.


Re: Spark SQL - Unable to use Hive UDF because of ClassNotFoundException

2015-01-30 Thread Marcelo Vanzin
Hi Capitão,

Since you're using CDH, your question is probably more appropriate for
the cdh-u...@cloudera.org list.

The problem you're seeing is most probably an artifact of the way CDH
is currently packaged. You have to add Hive jars manually to you Spark
app's classpath if you want to use the Hive support.

For example, if you're using CDH rpm/deb packages, you'd add all jars
in /usr/lib/hive/lib/ to `spark.driver.extraClassPath` and
`spark.executor.extraClassPath`, with the added caveat of excluding
the "guava" jar from that list.


On Fri, Jan 30, 2015 at 9:17 AM, Capitão  wrote:
> I've been trying to run HiveQL queries with UDFs in Spark SQL, but with no
> success. The problem occurs only when using functions, like the
> from_unixtime (represented by the Hive class UDFFromUnixTime).
>
> I'm using Spark 1.2 with CDH5.3.0. Running the queries in local mode work,
> but in Yarn mode don't. I'm creating an uber-jar with all the needed
> dependencies, excluding the ones provided by the cluster (Spark, Hadoop) and
> including the Hive ones. When I run the queries in Yarn I get the following
> exception:
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due
> to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
> Lost task 1.3 in stage 0.0 (TID 20, ):
> java.lang.NoClassDefFoundError: Lorg/apache/hadoop/hive/ql/exec/UDF;
> at java.lang.Class.getDeclaredFields0(Native Method)
> at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
> at java.lang.Class.getDeclaredField(Class.java:1951)
> at
> java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
> at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
> at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
> at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
> at
> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
> at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
> at
> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInp

Error Compiling

2015-01-30 Thread Eduardo Costa Alfaia
Hi Guys,

some idea how solve this error

[error] 
/sata_disk/workspace/spark-1.1.1/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala:76:
 missing parameter type for expanded function ((x$6, x$7) => x$6.$plus(x$7))

[error] val wordCounts = words.map(x => (x, 1L)).reduceByWindow(_ + _, _ - 
_, Minutes(1), Seconds(2), 2)

Thanks
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
Yeah, from an unscientific test, it looks like the time to cache the
blocks still dominates. Saving the count is probably a win, but not
big. Well, maybe good to know.

On Fri, Jan 30, 2015 at 10:47 PM, Stephen Boesch  wrote:
> Theoretically your approach would require less overhead - i.e. a collect on
> the driver is not required as the last step.  But maybe the difference is
> small and that particular path may or may not have been properly optimized
> vs the count(). Do you have a biggish data set to compare the timings?
>
> 2015-01-30 14:42 GMT-08:00 Sean Owen :
>>
>> So far, the canonical way to materialize an RDD just to make sure it's
>> cached is to call count(). That's fine but incurs the overhead of
>> actually counting the elements.
>>
>> However, rdd.foreachPartition(p => None) for example also seems to
>> cause the RDD to be materialized, and is a no-op. Is that a better way
>> to do it or am I not thinking of why it's insufficient?
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Trouble deploying spark program because of soft link?

2015-01-30 Thread suhshekar52
-_- Sorry for the spam...I thought I could run spark apps on workers, but I
cloned it on my spark master and now it works. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-deploying-spark-program-because-of-soft-link-tp21450p21451.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Yifan LI
Yes, I think so, esp. for a pregel application… have any suggestion?

Best,
Yifan LI





> On 30 Jan 2015, at 22:25, Sonal Goyal  wrote:
> 
> Is your code hitting frequent garbage collection? 
> 
> Best Regards,
> Sonal
> Founder, Nube Technologies  
> 
>  
> 
> 
> 
> On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI  > wrote:
> 
>> 
>> 
>> Hi,
>> 
>> I am running my graphx application on Spark 1.2.0(11 nodes cluster), has 
>> requested 30GB memory per node and 100 cores for around 1GB input dataset(5 
>> million vertices graph).
>> 
>> But the error below always happen…
>> 
>> Is there anyone could give me some points? 
>> 
>> (BTW, the overall edge/vertex RDDs will reach more than 100GB during graph 
>> computation, and another version of my application can work well on the same 
>> dataset while it need much less memory during computation)
>> 
>> Thanks in advance!!!
>> 
>> 
>> 15/01/29 18:05:08 ERROR ContextCleaner: Error cleaning broadcast 60
>> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>>  at 
>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>>  at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>>  at 
>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>>  at scala.concurrent.Await$.result(package.scala:107)
>>  at 
>> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
>>  at 
>> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
>>  at 
>> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>>  at 
>> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
>>  at 
>> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)
>>  at 
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)
>>  at 
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)
>>  at scala.Option.foreach(Option.scala:236)
>>  at 
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)
>>  at 
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
>>  at 
>> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
>>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>>  at org.apache.spark.ContextCleaner.org 
>> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)
>>  at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
>> [Stage 91:===>  (2 + 4) 
>> / 6]15/01/29 18:08:15 ERROR SparkDeploySchedulerBackend: Asked to remove 
>> non-existent executor 0
>> [Stage 93:>  (29 + 20) / 
>> 49]15/01/29 23:47:03 ERROR TaskSchedulerImpl: Lost executor 9 on 
>> small11-tap1.common.lip6.fr : remote 
>> Akka client disassociated
>> [Stage 83:>   (1 + 0) / 6][Stage 86:>   (0 + 1) / 2][Stage 88:>   (0 + 2) / 
>> 8]15/01/29 23:47:06 ERROR SparkDeploySchedulerBackend: Asked to remove 
>> non-existent executor 9
>> [Stage 83:===>  (5 + 1) / 6][Stage 88:=>   (9 + 2) / 
>> 11]15/01/29 23:57:30 ERROR TaskSchedulerImpl: Lost executor 8 on 
>> small10-tap1.common.lip6.fr : remote 
>> Akka client disassociated
>> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove 
>> non-existent executor 8
>> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove 
>> non-existent executor 8
>> 
>> Best,
>> Yifan LI
>> 
>> 
>> 
>> 
>> 
> 
> 



Re: Cheapest way to materialize an RDD?

2015-01-30 Thread Stephen Boesch
Theoretically your approach would require less overhead - i.e. a collect on
the driver is not required as the last step.  But maybe the difference is
small and that particular path may or may not have been properly optimized
vs the count(). Do you have a biggish data set to compare the timings?

2015-01-30 14:42 GMT-08:00 Sean Owen :

> So far, the canonical way to materialize an RDD just to make sure it's
> cached is to call count(). That's fine but incurs the overhead of
> actually counting the elements.
>
> However, rdd.foreachPartition(p => None) for example also seems to
> cause the RDD to be materialized, and is a no-op. Is that a better way
> to do it or am I not thinking of why it's insufficient?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Cheapest way to materialize an RDD?

2015-01-30 Thread Sean Owen
So far, the canonical way to materialize an RDD just to make sure it's
cached is to call count(). That's fine but incurs the overhead of
actually counting the elements.

However, rdd.foreachPartition(p => None) for example also seems to
cause the RDD to be materialized, and is a no-op. Is that a better way
to do it or am I not thinking of why it's insufficient?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Trouble deploying spark program because of soft link?

2015-01-30 Thread suhshekar52
Sorry if this is a double post...I'm not sure if I can send from email or
have to come to the user list to create a new topic.

A bit confused on this one...I have set up the KafkaWordCount found here:
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

Everything runs fine when I run it using this on instance A: repository:
https://github.com/aijibd/KafkaSpark

The word count works fine, etc. 

I literally clone that repository and try to run it on instance B and get
the following error:

I've gotten NoClassDefFoundErrors while building the pom file, but I don't
think missing dependencies is the issue here as it says
"/usr/lib/hadoop/bin/hadoop: No such file or directory" which is true,
hadoop is under usr/bin/ not usr/lib. I'm confused why it is looking for a
file there...I don't recall changing any settings regarding this. Thank you!

line 83: /usr/lib/hadoop/bin/hadoop: No such file or directory
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/FSDataInputStream
at
org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:307)
at
org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:220)
at
org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:75)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 5 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trouble-deploying-spark-program-because-of-soft-link-tp21450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why is DecimalType separate from DataType ?

2015-01-30 Thread Michael Armbrust
You are grabbing the singleton, not the class.  You need to specify the
precision (i.e. DecimalType.Unlimited or DecimalType(precision, scale))

On Fri, Jan 30, 2015 at 2:23 PM, Manoj Samel 
wrote:

> Spark 1.2
>
> While building schemaRDD using StructType
>
> xxx = new StructField("credit_amount", DecimalType, true) gives error
> "type mismatch; found :
> org.apache.spark.sql.catalyst.types.DecimalType.type required:
> org.apache.spark.sql.catalyst.types.DataType"
>
> From
> https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.package,
> it seems DecimalType = sql.catalyst.types.DecimalType is separate from
> DataType = sql.catalyst.types.DataType
>
> Not sure why that is the case? How does one uses Decimal and other types
> in StructField?
>
> Thanks,
>
>
>
>


Why is DecimalType separate from DataType ?

2015-01-30 Thread Manoj Samel
Spark 1.2

While building schemaRDD using StructType

xxx = new StructField("credit_amount", DecimalType, true) gives error "type
mismatch; found : org.apache.spark.sql.catalyst.types.DecimalType.type
required: org.apache.spark.sql.catalyst.types.DataType"

From
https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.package,
it seems DecimalType = sql.catalyst.types.DecimalType is separate from
DataType = sql.catalyst.types.DataType

Not sure why that is the case? How does one uses Decimal and other types in
StructField?

Thanks,


Trouble deploying spark program because of soft link?

2015-01-30 Thread Su She
Hello Everyone,

A bit confused on this one...I have set up the KafkaWordCount found here:
https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java

Everything runs fine when I run it using this on instance A: repository:
https://github.com/aijibd/KafkaSpark

The word count works fine, etc.

I literally clone that repository and try to run it on instance B and get
the following error:

I've gotten NoClassDefFoundErrors while building the pom file, but I don't
think missing dependencies is the issue here as it says
"/usr/lib/hadoop/bin/hadoop: No such file or directory" which is true,
hadoop is under usr/bin/ not usr/lib. I'm confused why it is looking for a
file there...I don't recall changing any settings regarding this. Thank you!

line 83: /usr/lib/hadoop/bin/hadoop: No such file or directory
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/fs/FSDataInputStream
at
org.apache.spark.deploy.SparkSubmitArguments.parse$1(SparkSubmitArguments.scala:307)
at
org.apache.spark.deploy.SparkSubmitArguments.parseOpts(SparkSubmitArguments.scala:220)
at
org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:75)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:70)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 5 more


Re: [Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Sonal Goyal
Is your code hitting frequent garbage collection?

Best Regards,
Sonal
Founder, Nube Technologies 





On Fri, Jan 30, 2015 at 7:52 PM, Yifan LI  wrote:

>
>
>
> Hi,
>
> I am running my graphx application on Spark 1.2.0(11 nodes cluster), has
> requested 30GB memory per node and 100 cores for around 1GB input dataset(5
> million vertices graph).
>
> But the error below always happen…
>
> Is there anyone could give me some points?
>
> (BTW, the overall edge/vertex RDDs will reach more than 100GB during graph
> computation, and another version of my application can work well on the
> same dataset while it need much less memory during computation)
>
> Thanks in advance!!!
>
>
> 15/01/29 18:05:08 ERROR ContextCleaner: Error cleaning broadcast 60
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
> at
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
> at
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
> at
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
> at
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
> at
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
> at org.apache.spark.ContextCleaner.org
> 
> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)
> at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
> [Stage 91:===>  (2 +
> 4) / 6]15/01/29 18:08:15 ERROR SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 0
> [Stage 93:>  (29 + 20)
> / 49]15/01/29 23:47:03 ERROR TaskSchedulerImpl: Lost executor 9 on
> small11-tap1.common.lip6.fr: remote Akka client disassociated
> [Stage 83:>   (1 + 0) / 6][Stage 86:>   (0 + 1) / 2][Stage 88:>   (0 + 2)
> / 8]15/01/29 23:47:06 ERROR SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 9
> [Stage 83:===>  (5 + 1) / 6][Stage 88:=>   (9 + 2)
> / 11]15/01/29 23:57:30 ERROR TaskSchedulerImpl: Lost executor 8 on
> small10-tap1.common.lip6.fr: remote Akka client disassociated
> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 8
> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 8
>
> Best,
> Yifan LI
>
>
>
>
>
>
>


Re: Serialized task result size exceeded

2015-01-30 Thread Charles Feduke
Are you using the default Java object serialization, or have you tried Kryo
yet? If you haven't tried Kryo please do and let me know how much it
impacts the serialization size. (I know its more efficient, I'm curious to
know how much more efficient, and I'm being lazy - I don't have ~6K 500MB
files on hand.)

You can saveAsObjectFile on maybe a take(1) from an RDD and examine the
serialized output to see if maybe a much larger graph than you expect is
being output.

On Fri Jan 30 2015 at 3:47:31 PM ankits  wrote:

> This is on spark 1.2
>
> I am loading ~6k parquet files, roughly 500 MB each into a schemaRDD, and
> calling count() on it.
>
> After loading about 2705 tasks (there is one per file), the job crashes
> with
> this error:
> Total size of serialized results of 2705 tasks (1024.0 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)
>
> This indicates that the results of each task are about 2705/1024 = 2.6MB
> each. Is that normal? I don't know exactly what the result of each task
> would be, but 2.6 MB for each seems too high. Can anyone offer an
> explanation as to what the normal size should be if this is too high, or
> ways to reduce this?
>
> Thanks.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Serialized-task-result-size-exceeded-tp21449.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: groupByKey is not working

2015-01-30 Thread Amit Behera
Thank you very much Charles, I got it  :)



On Sat, Jan 31, 2015 at 2:20 AM, Charles Feduke 
wrote:

> You'll still need to:
>
> import org.apache.spark.SparkContext._
>
> Importing org.apache.spark._ does _not_ recurse into sub-objects or
> sub-packages, it only brings in whatever is at the level of the package or
> object imported.
>
> SparkContext._ has some implicits, one of them for adding groupByKey to an
> RDD[_] IIRC.
>
>
> On Fri Jan 30 2015 at 3:48:22 PM Stephen Boesch  wrote:
>
>> Amit - IJ will not find it until you add the import as Sean mentioned.
>> It includes implicits that intellij will not know about otherwise.
>>
>> 2015-01-30 12:44 GMT-08:00 Amit Behera :
>>
>> I am sorry Sean.
>>>
>>> I am developing code in intelliJ Idea. so with the above dependencies I
>>> am not able to find *groupByKey* when I am searching by ctrl+
>>>
>>>
>>> On Sat, Jan 31, 2015 at 2:04 AM, Sean Owen  wrote:
>>>
 When you post a question anywhere, and say "it's not working", you
 *really* need to say what that means.


 On Fri, Jan 30, 2015 at 8:20 PM, Amit Behera 
 wrote:
 > hi all,
 >
 > my sbt file is like this:
 >
 > name := "Spark"
 >
 > version := "1.0"
 >
 > scalaVersion := "2.10.4"
 >
 > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
 >
 > libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
 >
 >
 > code:
 >
 > object SparkJob
 > {
 >
 >   def pLines(lines:Iterator[String])={
 > val parser=new CSVParser()
 > lines.map(l=>{val vs=parser.parseLine(l)
 >   (vs(0),vs(1).toInt)})
 >   }
 >
 >   def main(args: Array[String]) {
 > val conf = new SparkConf().setAppName("Spark
 Job").setMaster("local")
 > val sc = new SparkContext(conf)
 > val data = sc.textFile("/home/amit/testData.csv").cache()
 > val result = data.mapPartitions(pLines).groupByKey
 > //val list = result.filter(x=> {(x._1).contains("24050881")})
 >
 >   }
 >
 > }
 >
 >
 > Here groupByKey is not working . But same thing is working from
 spark-shell.
 >
 > Please help me
 >
 >
 > Thanks
 >
 > Amit

>>>
>>>


Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
You'll still need to:

import org.apache.spark.SparkContext._

Importing org.apache.spark._ does _not_ recurse into sub-objects or
sub-packages, it only brings in whatever is at the level of the package or
object imported.

SparkContext._ has some implicits, one of them for adding groupByKey to an
RDD[_] IIRC.

On Fri Jan 30 2015 at 3:48:22 PM Stephen Boesch  wrote:

> Amit - IJ will not find it until you add the import as Sean mentioned.  It
> includes implicits that intellij will not know about otherwise.
>
> 2015-01-30 12:44 GMT-08:00 Amit Behera :
>
> I am sorry Sean.
>>
>> I am developing code in intelliJ Idea. so with the above dependencies I
>> am not able to find *groupByKey* when I am searching by ctrl+
>>
>>
>> On Sat, Jan 31, 2015 at 2:04 AM, Sean Owen  wrote:
>>
>>> When you post a question anywhere, and say "it's not working", you
>>> *really* need to say what that means.
>>>
>>>
>>> On Fri, Jan 30, 2015 at 8:20 PM, Amit Behera 
>>> wrote:
>>> > hi all,
>>> >
>>> > my sbt file is like this:
>>> >
>>> > name := "Spark"
>>> >
>>> > version := "1.0"
>>> >
>>> > scalaVersion := "2.10.4"
>>> >
>>> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
>>> >
>>> > libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
>>> >
>>> >
>>> > code:
>>> >
>>> > object SparkJob
>>> > {
>>> >
>>> >   def pLines(lines:Iterator[String])={
>>> > val parser=new CSVParser()
>>> > lines.map(l=>{val vs=parser.parseLine(l)
>>> >   (vs(0),vs(1).toInt)})
>>> >   }
>>> >
>>> >   def main(args: Array[String]) {
>>> > val conf = new SparkConf().setAppName("Spark
>>> Job").setMaster("local")
>>> > val sc = new SparkContext(conf)
>>> > val data = sc.textFile("/home/amit/testData.csv").cache()
>>> > val result = data.mapPartitions(pLines).groupByKey
>>> > //val list = result.filter(x=> {(x._1).contains("24050881")})
>>> >
>>> >   }
>>> >
>>> > }
>>> >
>>> >
>>> > Here groupByKey is not working . But same thing is working from
>>> spark-shell.
>>> >
>>> > Please help me
>>> >
>>> >
>>> > Thanks
>>> >
>>> > Amit
>>>
>>
>>


Re: spark challenge: zip with next???

2015-01-30 Thread Derrick Burns
Koert, thanks for the referral to your current pull request!  I found it
very thoughtful and thought-provoking.



On Fri, Jan 30, 2015 at 9:19 AM, Koert Kuipers  wrote:

> and if its a single giant timeseries that is already sorted then Mohit's
> solution sounds good to me.
>
> On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak 
> wrote:
>
>> But isn't foldLeft() overkill for the originally stated use case of max
>> diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative
>> non-associative accumulation as opposed to an embarrassingly parallel
>> operation such as this one?
>>
>> This use case reminds me of FIR filtering in DSP. It seems that RDDs
>> could use something that serves the same purpose as
>> scala.collection.Iterator.sliding.
>>
>>   --
>>  *From:* Koert Kuipers 
>> *To:* Mohit Jaggi 
>> *Cc:* Tobias Pfeiffer ; "Ganelin, Ilya" <
>> ilya.gane...@capitalone.com>; derrickburns ; "
>> user@spark.apache.org" 
>> *Sent:* Friday, January 30, 2015 7:11 AM
>> *Subject:* Re: spark challenge: zip with next???
>>
>> assuming the data can be partitioned then you have many timeseries for
>> which you want to detect potential gaps. also assuming the resulting gaps
>> info per timeseries is much smaller data then the timeseries data itself,
>> then this is a classical example to me of a sorted (streaming) foldLeft,
>> requiring an efficient secondary sort in the spark shuffle. i am trying to
>> get that into spark here:
>> https://issues.apache.org/jira/browse/SPARK-3655
>>
>>
>>
>> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi 
>> wrote:
>>
>>
>> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
>>
>> you can use the MLLib function or do the following (which is what I had
>> done):
>>
>> - in first pass over the data, using mapPartitionWithIndex, gather the
>> first item in each partition. you can use collect (or aggregator) for this.
>> “key” them by the partition index. at the end, you will have a map
>>(partition index) --> first item
>> - in the second pass over the data, using mapPartitionWithIndex again,
>> look at two (or in the general case N items at a time, you can use scala’s
>> sliding iterator) items at a time and check the time difference(or any
>> sliding window computation). To this mapParitition, pass the map created in
>> previous step. You will need to use them to check the last item in this
>> partition.
>>
>> If you can tolerate a few inaccuracies then you can just do the second
>> step. You will miss the “boundaries” of the partitions but it might be
>> acceptable for your use case.
>>
>>
>>
>> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer  wrote:
>>
>> Hi,
>>
>> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya <
>> ilya.gane...@capitalone.com> wrote:
>>
>>  Make a copy of your RDD with an extra entry in the beginning to offset.
>> The you can zip the two RDDs and run a map to generate an RDD of
>> differences.
>>
>>
>> Does that work? I recently tried something to compute differences between
>> each entry and the next, so I did
>>   val rdd1 = ... // null element + rdd
>>   val rdd2 = ... // rdd + null element
>> but got an error message about zip requiring data sizes in each partition
>> to match.
>>
>> Tobias
>>
>>
>>
>>
>>
>>
>


Re: groupByKey is not working

2015-01-30 Thread Amit Behera
Hi Charles,

I forgot to mention. But I imported the following

import au.com.bytecode.opencsv.CSVParser

import org.apache.spark._

On Sat, Jan 31, 2015 at 2:09 AM, Charles Feduke 
wrote:

> Define "not working". Not compiling? If so you need:
>
> import org.apache.spark.SparkContext._
>
>
> On Fri Jan 30 2015 at 3:21:45 PM Amit Behera  wrote:
>
>> hi all,
>>
>> my sbt file is like this:
>>
>> name := "Spark"
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
>>
>> libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
>>
>>
>> *code:*
>>
>> object SparkJob
>> {
>>
>>   def pLines(lines:Iterator[String])={
>> val parser=new CSVParser()
>> lines.map(l=>{val vs=parser.parseLine(l)
>>   (vs(0),vs(1).toInt)})
>>   }
>>
>>   def main(args: Array[String]) {
>> val conf = new SparkConf().setAppName("Spark Job").setMaster("local")
>> val sc = new SparkContext(conf)
>> val data = sc.textFile("/home/amit/testData.csv").cache()
>> val result = data.mapPartitions(pLines).groupByKey
>> //val list = result.filter(x=> {(x._1).contains("24050881")})
>>
>>   }
>>
>> }
>>
>>
>> Here groupByKey is not working . But same thing is working from 
>> *spark-shell.*
>>
>> Please help me
>>
>>
>> Thanks
>>
>> Amit
>>
>>


Re: groupByKey is not working

2015-01-30 Thread Stephen Boesch
Amit - IJ will not find it until you add the import as Sean mentioned.  It
includes implicits that intellij will not know about otherwise.

2015-01-30 12:44 GMT-08:00 Amit Behera :

> I am sorry Sean.
>
> I am developing code in intelliJ Idea. so with the above dependencies I am
> not able to find *groupByKey* when I am searching by ctrl+
>
>
> On Sat, Jan 31, 2015 at 2:04 AM, Sean Owen  wrote:
>
>> When you post a question anywhere, and say "it's not working", you
>> *really* need to say what that means.
>>
>>
>> On Fri, Jan 30, 2015 at 8:20 PM, Amit Behera 
>> wrote:
>> > hi all,
>> >
>> > my sbt file is like this:
>> >
>> > name := "Spark"
>> >
>> > version := "1.0"
>> >
>> > scalaVersion := "2.10.4"
>> >
>> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
>> >
>> > libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
>> >
>> >
>> > code:
>> >
>> > object SparkJob
>> > {
>> >
>> >   def pLines(lines:Iterator[String])={
>> > val parser=new CSVParser()
>> > lines.map(l=>{val vs=parser.parseLine(l)
>> >   (vs(0),vs(1).toInt)})
>> >   }
>> >
>> >   def main(args: Array[String]) {
>> > val conf = new SparkConf().setAppName("Spark
>> Job").setMaster("local")
>> > val sc = new SparkContext(conf)
>> > val data = sc.textFile("/home/amit/testData.csv").cache()
>> > val result = data.mapPartitions(pLines).groupByKey
>> > //val list = result.filter(x=> {(x._1).contains("24050881")})
>> >
>> >   }
>> >
>> > }
>> >
>> >
>> > Here groupByKey is not working . But same thing is working from
>> spark-shell.
>> >
>> > Please help me
>> >
>> >
>> > Thanks
>> >
>> > Amit
>>
>
>


Serialized task result size exceeded

2015-01-30 Thread ankits
This is on spark 1.2

I am loading ~6k parquet files, roughly 500 MB each into a schemaRDD, and
calling count() on it.

After loading about 2705 tasks (there is one per file), the job crashes with
this error:
Total size of serialized results of 2705 tasks (1024.0 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)

This indicates that the results of each task are about 2705/1024 = 2.6MB
each. Is that normal? I don't know exactly what the result of each task
would be, but 2.6 MB for each seems too high. Can anyone offer an
explanation as to what the normal size should be if this is too high, or
ways to reduce this?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Serialized-task-result-size-exceeded-tp21449.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: groupByKey is not working

2015-01-30 Thread Amit Behera
I am sorry Sean.

I am developing code in intelliJ Idea. so with the above dependencies I am
not able to find *groupByKey* when I am searching by ctrl+


On Sat, Jan 31, 2015 at 2:04 AM, Sean Owen  wrote:

> When you post a question anywhere, and say "it's not working", you
> *really* need to say what that means.
>
> On Fri, Jan 30, 2015 at 8:20 PM, Amit Behera  wrote:
> > hi all,
> >
> > my sbt file is like this:
> >
> > name := "Spark"
> >
> > version := "1.0"
> >
> > scalaVersion := "2.10.4"
> >
> > libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
> >
> > libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
> >
> >
> > code:
> >
> > object SparkJob
> > {
> >
> >   def pLines(lines:Iterator[String])={
> > val parser=new CSVParser()
> > lines.map(l=>{val vs=parser.parseLine(l)
> >   (vs(0),vs(1).toInt)})
> >   }
> >
> >   def main(args: Array[String]) {
> > val conf = new SparkConf().setAppName("Spark Job").setMaster("local")
> > val sc = new SparkContext(conf)
> > val data = sc.textFile("/home/amit/testData.csv").cache()
> > val result = data.mapPartitions(pLines).groupByKey
> > //val list = result.filter(x=> {(x._1).contains("24050881")})
> >
> >   }
> >
> > }
> >
> >
> > Here groupByKey is not working . But same thing is working from
> spark-shell.
> >
> > Please help me
> >
> >
> > Thanks
> >
> > Amit
>


Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
Filed https://issues.apache.org/jira/browse/SPARK-5500 for this.

-Sandy

On Fri, Jan 30, 2015 at 11:59 AM, Aaron Davidson  wrote:

> Ah, this is in particular an issue due to sort-based shuffle (it was not
> the case for hash-based shuffle, which would immediately serialize each
> record rather than holding many in memory at once). The documentation
> should be updated.
>
> On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza 
> wrote:
>
>> Hi Andrew,
>>
>> Here's a note from the doc for sequenceFile:
>>
>> * '''Note:''' Because Hadoop's RecordReader class re-uses the same
>> Writable object for each
>> * record, directly caching the returned RDD will create many
>> references to the same object.
>> * If you plan to directly cache Hadoop writable objects, you should
>> first copy them using
>> * a `map` function.
>>
>> This should probably say "direct cachingly *or directly shuffling*".  To
>> sort directly from a sequence file, the records need to be cloned first.
>>
>> -Sandy
>>
>>
>> On Fri, Jan 30, 2015 at 11:20 AM, andrew.rowson <
>> andrew.row...@thomsonreuters.com> wrote:
>>
>>> I've found a strange issue when trying to sort a lot of data in HDFS
>>> using
>>> spark 1.2.0 (CDH5.3.0). My data is in sequencefiles and the key is a
>>> class
>>> that derives from BytesWritable (the value is also a BytesWritable). I'm
>>> using a custom KryoSerializer to serialize the underlying byte array
>>> (basically write the length and the byte array).
>>>
>>> My spark job looks like this:
>>>
>>> spark.sequenceFile(inputPath, classOf[CustomKey],
>>> classOf[BytesWritable]).sortByKey().map(t =>
>>> t._1).saveAsTextFile(outputPath)
>>>
>>> CustomKey extends BytesWritable, adds a toString method and some other
>>> helper methods that extract and convert parts of the underlying byte[].
>>>
>>> This should simply output a series of textfiles which contain the sorted
>>> list of keys. The problem is that under certain circumstances I get many
>>> duplicate keys. The number of records output is correct, but it appears
>>> that
>>> large chunks of the output are simply copies of the last record in that
>>> chunk. E.g instead of [1,2,3,4,5,6,7,8,9] I'll see [9,9,9,9,9,9,9,9,9].
>>>
>>> This appears to happen only above certain input data volumes, and it
>>> appears
>>> to be when shuffle spills. For a job where shuffle spill for memory and
>>> disk
>>> = 0B, the data is correct. If there is any spill, I see the duplicate
>>> behaviour. Oddly, the shuffle write is much smaller when there's a spill.
>>> E.g. the non spill job has 18.8 GB of input and 14.9GB of shuffle write,
>>> whereas the spill job has 24.2 GB of input, and only 4.9GB of shuffle
>>> write.
>>> I'm guessing some sort of compression is happening on duplicate identical
>>> values?
>>>
>>> Oddly, I can fix this issue if I adjust my scala code to insert a map
>>> step
>>> before the call to sortByKey():
>>>
>>> .map(t => (new CustomKey(t._1),t._2))
>>>
>>> This constructor is just:
>>>
>>> public CustomKey(CustomKey left) { this.set(left); }
>>>
>>> Why does this work? I've no idea.
>>>
>>> The spark job is running in yarn-client mode with all the default
>>> configuration values set. Using the external shuffle service and
>>> disabling
>>> spill compression makes no difference.
>>>
>>> Is this a bug?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-key-when-sorting-BytesWritable-with-Kryo-tp21447.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: groupByKey is not working

2015-01-30 Thread Charles Feduke
Define "not working". Not compiling? If so you need:

import org.apache.spark.SparkContext._


On Fri Jan 30 2015 at 3:21:45 PM Amit Behera  wrote:

> hi all,
>
> my sbt file is like this:
>
> name := "Spark"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
>
> libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
>
>
> *code:*
>
> object SparkJob
> {
>
>   def pLines(lines:Iterator[String])={
> val parser=new CSVParser()
> lines.map(l=>{val vs=parser.parseLine(l)
>   (vs(0),vs(1).toInt)})
>   }
>
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Spark Job").setMaster("local")
> val sc = new SparkContext(conf)
> val data = sc.textFile("/home/amit/testData.csv").cache()
> val result = data.mapPartitions(pLines).groupByKey
> //val list = result.filter(x=> {(x._1).contains("24050881")})
>
>   }
>
> }
>
>
> Here groupByKey is not working . But same thing is working from *spark-shell.*
>
> Please help me
>
>
> Thanks
>
> Amit
>
>


Re: groupByKey is not working

2015-01-30 Thread Arush Kharbanda
Hi Amit,

What error does it through?

Thanks
Arush

On Sat, Jan 31, 2015 at 1:50 AM, Amit Behera  wrote:

> hi all,
>
> my sbt file is like this:
>
> name := "Spark"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
>
> libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
>
>
> *code:*
>
> object SparkJob
> {
>
>   def pLines(lines:Iterator[String])={
> val parser=new CSVParser()
> lines.map(l=>{val vs=parser.parseLine(l)
>   (vs(0),vs(1).toInt)})
>   }
>
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Spark Job").setMaster("local")
> val sc = new SparkContext(conf)
> val data = sc.textFile("/home/amit/testData.csv").cache()
> val result = data.mapPartitions(pLines).groupByKey
> //val list = result.filter(x=> {(x._1).contains("24050881")})
>
>   }
>
> }
>
>
> Here groupByKey is not working . But same thing is working from *spark-shell.*
>
> Please help me
>
>
> Thanks
>
> Amit
>
>


-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


groupByKey is not working

2015-01-30 Thread Amit Behera
hi all,

my sbt file is like this:

name := "Spark"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"

libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"


*code:*

object SparkJob
{

  def pLines(lines:Iterator[String])={
val parser=new CSVParser()
lines.map(l=>{val vs=parser.parseLine(l)
  (vs(0),vs(1).toInt)})
  }

  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Job").setMaster("local")
val sc = new SparkContext(conf)
val data = sc.textFile("/home/amit/testData.csv").cache()
val result = data.mapPartitions(pLines).groupByKey
//val list = result.filter(x=> {(x._1).contains("24050881")})

  }

}


Here groupByKey is not working . But same thing is working from *spark-shell.*

Please help me


Thanks

Amit


Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Aaron Davidson
Ah, this is in particular an issue due to sort-based shuffle (it was not
the case for hash-based shuffle, which would immediately serialize each
record rather than holding many in memory at once). The documentation
should be updated.

On Fri, Jan 30, 2015 at 11:27 AM, Sandy Ryza 
wrote:

> Hi Andrew,
>
> Here's a note from the doc for sequenceFile:
>
> * '''Note:''' Because Hadoop's RecordReader class re-uses the same
> Writable object for each
> * record, directly caching the returned RDD will create many
> references to the same object.
> * If you plan to directly cache Hadoop writable objects, you should
> first copy them using
> * a `map` function.
>
> This should probably say "direct cachingly *or directly shuffling*".  To
> sort directly from a sequence file, the records need to be cloned first.
>
> -Sandy
>
>
> On Fri, Jan 30, 2015 at 11:20 AM, andrew.rowson <
> andrew.row...@thomsonreuters.com> wrote:
>
>> I've found a strange issue when trying to sort a lot of data in HDFS using
>> spark 1.2.0 (CDH5.3.0). My data is in sequencefiles and the key is a class
>> that derives from BytesWritable (the value is also a BytesWritable). I'm
>> using a custom KryoSerializer to serialize the underlying byte array
>> (basically write the length and the byte array).
>>
>> My spark job looks like this:
>>
>> spark.sequenceFile(inputPath, classOf[CustomKey],
>> classOf[BytesWritable]).sortByKey().map(t =>
>> t._1).saveAsTextFile(outputPath)
>>
>> CustomKey extends BytesWritable, adds a toString method and some other
>> helper methods that extract and convert parts of the underlying byte[].
>>
>> This should simply output a series of textfiles which contain the sorted
>> list of keys. The problem is that under certain circumstances I get many
>> duplicate keys. The number of records output is correct, but it appears
>> that
>> large chunks of the output are simply copies of the last record in that
>> chunk. E.g instead of [1,2,3,4,5,6,7,8,9] I'll see [9,9,9,9,9,9,9,9,9].
>>
>> This appears to happen only above certain input data volumes, and it
>> appears
>> to be when shuffle spills. For a job where shuffle spill for memory and
>> disk
>> = 0B, the data is correct. If there is any spill, I see the duplicate
>> behaviour. Oddly, the shuffle write is much smaller when there's a spill.
>> E.g. the non spill job has 18.8 GB of input and 14.9GB of shuffle write,
>> whereas the spill job has 24.2 GB of input, and only 4.9GB of shuffle
>> write.
>> I'm guessing some sort of compression is happening on duplicate identical
>> values?
>>
>> Oddly, I can fix this issue if I adjust my scala code to insert a map step
>> before the call to sortByKey():
>>
>> .map(t => (new CustomKey(t._1),t._2))
>>
>> This constructor is just:
>>
>> public CustomKey(CustomKey left) { this.set(left); }
>>
>> Why does this work? I've no idea.
>>
>> The spark job is running in yarn-client mode with all the default
>> configuration values set. Using the external shuffle service and disabling
>> spill compression makes no difference.
>>
>> Is this a bug?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-key-when-sorting-BytesWritable-with-Kryo-tp21447.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Are you using SGD for logistic regression? There's a random element
there too, by nature. I looked into the code and see that you can't
set a seed, but actually, the sampling is done with a fixed seed per
partition anyway. Hm.

In general you would not expect these algorithms to produce the same
result, given the stochastic nature. In this particular case, I'm not
sure if you can or should be able to get the implementation to act
deterministically. Even if the overt use of randomness is seed-able,
there may be some non-determinism in the distributed nature of the
processing that is having an effect.

On Fri, Jan 30, 2015 at 7:27 PM, Jianguo Li  wrote:
> Thanks. I did specify a seed parameter.
>
> Seems that the problem is not caused by kFold. I actually ran another
> experiment without cross validation. I just built a model with the training
> data and then tested the model on the test data. However, the accuracy still
> varies from one run to another. Interestingly, this only happens when I ran
> the experiment on our cluster. If I ran the experiment on my local machine,
> I can reproduce the result each time. Has anybody encountered similar issue
> before?
>
> Thanks,
>
> Jianguo
>
> On Fri, Jan 30, 2015 at 11:22 AM, Sean Owen  wrote:
>>
>> Have a look at the source code for MLUtils.kFold. Yes, there is a
>> random element. That's good; you want the folds to be randomly chosen.
>> Note there is a seed parameter, as in a lot of the APIs, that lets you
>> fix the RNG seed and so get the same result every time, if you need
>> to.
>>
>> On Fri, Jan 30, 2015 at 4:12 PM, Jianguo Li 
>> wrote:
>> > Hi,
>> >
>> > I am using the utility function kFold provided in Spark for doing k-fold
>> > cross validation using logistic regression. However, each time I run the
>> > experiment, I got different different result. Since everything else
>> > stays
>> > constant, I was wondering if this is due to the kFold function I used.
>> > Does
>> > anyone know if the kFold gives you a different split on a data set each
>> > time
>> > you call it?
>> >
>> > Thanks,
>> >
>> > Jianguo
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark and and page allocation failures due to memory fragmentation

2015-01-30 Thread Antony Mayi
Hi,
When running big mapreduce operation with pyspark (in the particular case using 
lot of sets and operations on sets in the map tasks so likely to be allocating 
and freeing loads of pages) I eventually get kernel error 'python: page 
allocation failure: order:10, mode:0x2000d0' plus very verbose dump which I can 
reduce to following snippet:
Node 1 Normal: 3601*4kB (UEM) 3159*8kB (UEM) 1669*16kB (UEM) 763*32kB (UEM) 
1451*64kB (UEM) 15*128kB (UM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 
185836kB
...SLAB: Unable to allocate memory on node 1 (gfp=0xd0)
cache: size-4194304, object size: 4194304, order: 10
so simply the memory got fragmented and there are no higher order pages. 
interesting thing is that there is no error thrown by spark itself - the 
processing just gets stuck without any error or anything (only the kernel dmesg 
explains what happened in the background).
any kernel experts out there with an advice how to avoid this? have tried few 
vm options but still no joy.
running spark 1.2.0 (cdh 5.3.0) on kernel 3.8.13
thanks,Antony. 

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Ayoub
No it is not the case, here is the gist to reproduce the issue
https://gist.github.com/ayoub-benali/54d6f3b8635530e4e936
On Jan 30, 2015 8:29 PM, "Michael Armbrust"  wrote:

> Is it possible that your schema contains duplicate columns or column with
> spaces in the name?  The parquet library will often give confusing error
> messages in this case.
>
> On Fri, Jan 30, 2015 at 10:33 AM, Ayoub 
> wrote:
>
>> Hello,
>>
>> I have a problem when querying, with a hive context on spark
>> 1.2.1-snapshot, a column in my table which is nested data structure like an
>> array of struct.
>> The problems happens only on the table stored as parquet, while querying
>> the Schema RDD saved, as a temporary table, don't lead to any exception.
>>
>> my steps are:
>> 1) reading JSON file
>> 2) creating a schema RDD and saving it as a tmp table
>> 3) creating an external table in hive meta store saved as parquet file
>> 4) inserting the data from the tmp table to the persisted table
>> 5) queering the persisted table lead to this exception:
>>
>> "select data.field1 from persisted_table LATERAL VIEW explode(data_array)
>> nestedStuff AS data"
>>
>> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1
>> in file hdfs://***/test_table/part-1
>> at
>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
>> at
>> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
>> at
>> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>> at
>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> at scala.collection.TraversableOnce$class.to
>> (TraversableOnce.scala:273)
>> at 
>> scala.collection.AbstractIterator.to(Iterator.scala:1157)
>> at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>> at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>> at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>> at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
>> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>> at java.util.ArrayList.get(ArrayList.java:411)
>> at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
>> at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
>> at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
>> at parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:99)
>> at parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:94)
>> at
>> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:274)
>> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
>> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
>> at
>> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
>> at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
>> at
>> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
>> at
>> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
>> ... 28 more
>>
>> Driver stacktrace:
>> at 
>> org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAG

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Michael Armbrust
Is it possible that your schema contains duplicate columns or column with
spaces in the name?  The parquet library will often give confusing error
messages in this case.

On Fri, Jan 30, 2015 at 10:33 AM, Ayoub  wrote:

> Hello,
>
> I have a problem when querying, with a hive context on spark
> 1.2.1-snapshot, a column in my table which is nested data structure like an
> array of struct.
> The problems happens only on the table stored as parquet, while querying
> the Schema RDD saved, as a temporary table, don't lead to any exception.
>
> my steps are:
> 1) reading JSON file
> 2) creating a schema RDD and saving it as a tmp table
> 3) creating an external table in hive meta store saved as parquet file
> 4) inserting the data from the tmp table to the persisted table
> 5) queering the persisted table lead to this exception:
>
> "select data.field1 from persisted_table LATERAL VIEW explode(data_array)
> nestedStuff AS data"
>
> parquet.io.ParquetDecodingException: Can not read value at 0 in block -1
> in file hdfs://***/test_table/part-1
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
> at
> parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
> at
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
> at
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at 
> scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
> at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
> at java.util.ArrayList.rangeCheck(ArrayList.java:635)
> at java.util.ArrayList.get(ArrayList.java:411)
> at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
> at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
> at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
> at parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:99)
> at parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:94)
> at
> parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:274)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
> at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
> at
> parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
> at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
> at
> parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
> at
> parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
> ... 28 more
>
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at
> scala.collection.mutable.ResizableA

Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Thanks. I did specify a seed parameter.

Seems that the problem is not caused by kFold. I actually ran another
experiment without cross validation. I just built a model with the training
data and then tested the model on the test data. However, the accuracy
still varies from one run to another. Interestingly, this only happens when
I ran the experiment on our cluster. If I ran the experiment on my local
machine, I can reproduce the result each time. Has anybody encountered
similar issue before?

Thanks,

Jianguo

On Fri, Jan 30, 2015 at 11:22 AM, Sean Owen  wrote:

> Have a look at the source code for MLUtils.kFold. Yes, there is a
> random element. That's good; you want the folds to be randomly chosen.
> Note there is a seed parameter, as in a lot of the APIs, that lets you
> fix the RNG seed and so get the same result every time, if you need
> to.
>
> On Fri, Jan 30, 2015 at 4:12 PM, Jianguo Li 
> wrote:
> > Hi,
> >
> > I am using the utility function kFold provided in Spark for doing k-fold
> > cross validation using logistic regression. However, each time I run the
> > experiment, I got different different result. Since everything else stays
> > constant, I was wondering if this is due to the kFold function I used.
> Does
> > anyone know if the kFold gives you a different split on a data set each
> time
> > you call it?
> >
> > Thanks,
> >
> > Jianguo
>


Re: Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread Sandy Ryza
Hi Andrew,

Here's a note from the doc for sequenceFile:

* '''Note:''' Because Hadoop's RecordReader class re-uses the same
Writable object for each
* record, directly caching the returned RDD will create many references
to the same object.
* If you plan to directly cache Hadoop writable objects, you should
first copy them using
* a `map` function.

This should probably say "direct cachingly *or directly shuffling*".  To
sort directly from a sequence file, the records need to be cloned first.

-Sandy


On Fri, Jan 30, 2015 at 11:20 AM, andrew.rowson <
andrew.row...@thomsonreuters.com> wrote:

> I've found a strange issue when trying to sort a lot of data in HDFS using
> spark 1.2.0 (CDH5.3.0). My data is in sequencefiles and the key is a class
> that derives from BytesWritable (the value is also a BytesWritable). I'm
> using a custom KryoSerializer to serialize the underlying byte array
> (basically write the length and the byte array).
>
> My spark job looks like this:
>
> spark.sequenceFile(inputPath, classOf[CustomKey],
> classOf[BytesWritable]).sortByKey().map(t =>
> t._1).saveAsTextFile(outputPath)
>
> CustomKey extends BytesWritable, adds a toString method and some other
> helper methods that extract and convert parts of the underlying byte[].
>
> This should simply output a series of textfiles which contain the sorted
> list of keys. The problem is that under certain circumstances I get many
> duplicate keys. The number of records output is correct, but it appears
> that
> large chunks of the output are simply copies of the last record in that
> chunk. E.g instead of [1,2,3,4,5,6,7,8,9] I'll see [9,9,9,9,9,9,9,9,9].
>
> This appears to happen only above certain input data volumes, and it
> appears
> to be when shuffle spills. For a job where shuffle spill for memory and
> disk
> = 0B, the data is correct. If there is any spill, I see the duplicate
> behaviour. Oddly, the shuffle write is much smaller when there's a spill.
> E.g. the non spill job has 18.8 GB of input and 14.9GB of shuffle write,
> whereas the spill job has 24.2 GB of input, and only 4.9GB of shuffle
> write.
> I'm guessing some sort of compression is happening on duplicate identical
> values?
>
> Oddly, I can fix this issue if I adjust my scala code to insert a map step
> before the call to sortByKey():
>
> .map(t => (new CustomKey(t._1),t._2))
>
> This constructor is just:
>
> public CustomKey(CustomKey left) { this.set(left); }
>
> Why does this work? I've no idea.
>
> The spark job is running in yarn-client mode with all the default
> configuration values set. Using the external shuffle service and disabling
> spill compression makes no difference.
>
> Is this a bug?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-key-when-sorting-BytesWritable-with-Kryo-tp21447.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Define size partitions

2015-01-30 Thread Davies Liu
I think the new API  sc. binaryRecords [1] (added in 1.2) can help in this case.

[1] 
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords

Davies

On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz  wrote:
> Hi,
>
> I want to process some files, there're a king of big, dozens of
> gigabytes each one. I get them like a array of bytes and there's an
> structure inside of them.
>
> I have a header which describes the structure. It could be like:
> Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
> This structure appears N times on the file.
>
> So, I could know the size of each block since it's fix. There's not
> separator among block and block.
>
> If I would do this with MapReduce, I could implement a new
> RecordReader and InputFormat  to read each block because I know the
> size of them and I'd fix the split size in the driver. (blockX1000 for
> example). On this way, I could know that each split for each mapper
> has complete blocks and there isn't a piece of the last block in the
> next split.
>
> Spark works with RDD and partitions, How could I resize  each
> partition to do that?? is it possible? I guess that Spark doesn't use
> the RecordReader and these classes for these tasks.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Duplicate key when sorting BytesWritable with Kryo?

2015-01-30 Thread andrew.rowson
I've found a strange issue when trying to sort a lot of data in HDFS using
spark 1.2.0 (CDH5.3.0). My data is in sequencefiles and the key is a class
that derives from BytesWritable (the value is also a BytesWritable). I'm
using a custom KryoSerializer to serialize the underlying byte array
(basically write the length and the byte array).

My spark job looks like this:

spark.sequenceFile(inputPath, classOf[CustomKey],
classOf[BytesWritable]).sortByKey().map(t =>
t._1).saveAsTextFile(outputPath)

CustomKey extends BytesWritable, adds a toString method and some other
helper methods that extract and convert parts of the underlying byte[].

This should simply output a series of textfiles which contain the sorted
list of keys. The problem is that under certain circumstances I get many
duplicate keys. The number of records output is correct, but it appears that
large chunks of the output are simply copies of the last record in that
chunk. E.g instead of [1,2,3,4,5,6,7,8,9] I'll see [9,9,9,9,9,9,9,9,9]. 

This appears to happen only above certain input data volumes, and it appears
to be when shuffle spills. For a job where shuffle spill for memory and disk
= 0B, the data is correct. If there is any spill, I see the duplicate
behaviour. Oddly, the shuffle write is much smaller when there's a spill.
E.g. the non spill job has 18.8 GB of input and 14.9GB of shuffle write,
whereas the spill job has 24.2 GB of input, and only 4.9GB of shuffle write.
I'm guessing some sort of compression is happening on duplicate identical
values?

Oddly, I can fix this issue if I adjust my scala code to insert a map step
before the call to sortByKey():

.map(t => (new CustomKey(t._1),t._2))

This constructor is just:

public CustomKey(CustomKey left) { this.set(left); }

Why does this work? I've no idea.

The spark job is running in yarn-client mode with all the default
configuration values set. Using the external shuffle service and disabling
spill compression makes no difference.

Is this a bug?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Duplicate-key-when-sorting-BytesWritable-with-Kryo-tp21447.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



KafkaWordCount

2015-01-30 Thread Eduardo Costa Alfaia
Hi Guys,

I would like to put in the kafkawordcount scala code the kafka parameter:  val 
kafkaParams = Map(“fetch.message.max.bytes” -> “400”). I’ve put this 
variable like this

val KafkaDStreams = (1 to numStreams) map {_ => 

  
KafkaUtils.createStream(ssc, kafkaParams, zkQuorum, group, 
topicpMap).map(_._2)


However I’ve gotten these erros:

 (jssc: org.apache.spark.streaming.api.java.JavaStreamingContext,zkQuorum: 
String,groupId: String,topics: jav  
   a.util.Map[String,Integer],storageLevel: 
org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.api.java.Jav   
 
aPairReceiverInputDStream[String,String]   

   
[error]   (ssc: org.apache.spark.streaming.StreamingContext,zkQuorum: 
String,groupId: String,topics: scala.collection.
 
immutable.Map[String,Int],storageLevel: 
org.apache.spark.storage.StorageLevel)org.apache.spark.streaming.dstream.Recei  
   
verInputDStream[(String, String)]

Thanks
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Define size partitions

2015-01-30 Thread Rishi Yadav
if you are only concerned about big partition size you can specify number
of partitions as an additional parameter while loading files form hdfs.

On Fri, Jan 30, 2015 at 9:47 AM, Sven Krasser  wrote:

> You can also use your InputFormat/RecordReader in Spark, e.g. using
> newAPIHadoopFile. See here:
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
> .
> -Sven
>
> On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz 
> wrote:
>
>> Hi,
>>
>> I want to process some files, there're a king of big, dozens of
>> gigabytes each one. I get them like a array of bytes and there's an
>> structure inside of them.
>>
>> I have a header which describes the structure. It could be like:
>> Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
>> This structure appears N times on the file.
>>
>> So, I could know the size of each block since it's fix. There's not
>> separator among block and block.
>>
>> If I would do this with MapReduce, I could implement a new
>> RecordReader and InputFormat  to read each block because I know the
>> size of them and I'd fix the split size in the driver. (blockX1000 for
>> example). On this way, I could know that each split for each mapper
>> has complete blocks and there isn't a piece of the last block in the
>> next split.
>>
>> Spark works with RDD and partitions, How could I resize  each
>> partition to do that?? is it possible? I guess that Spark doesn't use
>> the RecordReader and these classes for these tasks.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> http://sites.google.com/site/krasser/?utm_source=sig
>


[hive context] Unable to query array once saved as parquet

2015-01-30 Thread Ayoub
Hello,

I have a problem when querying, with a hive context on spark
1.2.1-snapshot, a column in my table which is nested data structure like an
array of struct.
The problems happens only on the table stored as parquet, while querying
the Schema RDD saved, as a temporary table, don't lead to any exception.

my steps are:
1) reading JSON file
2) creating a schema RDD and saving it as a tmp table
3) creating an external table in hive meta store saved as parquet file
4) inserting the data from the tmp table to the persisted table
5) queering the persisted table lead to this exception:

"select data.field1 from persisted_table LATERAL VIEW explode(data_array)
nestedStuff AS data"

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in
file hdfs://***/test_table/part-1
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at 
scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$17.apply(RDD.scala:797)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
at parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:99)
at parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:94)
at
parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:274)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
at
parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 28 more

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.sc

Spark streaming - tracking/deleting processed files

2015-01-30 Thread ganterm
We are running a Spark streaming job that retrieves files from a directory
(using textFileStream). 
One concern we are having is the case where the job is down but files are
still being added to the directory.
Once the job starts up again, those files are not being picked up (since
they are not new or changed while the job is running) but we would like them
to be processed. 
Is there a solution for that? Is there a way to keep track what files have
been processed and can we "force" older files to be picked up? Is there a
way to delete the processed files? 

Thanks!
Markus 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Aniket Bhatnagar
Right. Which makes me to believe that the directory is perhaps configured
somewhere and i have missed configuring the same. The process that is
submitting jobs (basically becomes driver) is running in sudo mode and the
executors are executed by YARN. The hadoop username is configured as
'hadoop' (default user in EMR).

On Fri, Jan 30, 2015, 11:25 PM Sven Krasser  wrote:

> From your stacktrace it appears that the S3 writer tries to write the data
> to a temp file on the local file system first. Taking a guess, that local
> directory doesn't exist or you don't have permissions for it.
> -Sven
>
> On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> I am programmatically submit spark jobs in yarn-client mode on EMR.
>> Whenever a job tries to save file to s3, it gives the below mentioned
>> exception. I think the issue might be what EMR is not setup properly as I
>> have to set all hadoop configurations manually in SparkContext. However, I
>> am not sure which configuration am I missing (if any).
>>
>> Configurations that I am using in SparkContext to setup EMRFS:
>> "spark.hadoop.fs.s3n.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem",
>> "spark.hadoop.fs.s3.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem",
>> "spark.hadoop.fs.emr.configuration.version": "1.0",
>> "spark.hadoop.fs.s3n.multipart.uploads.enabled": "true",
>> "spark.hadoop.fs.s3.enableServerSideEncryption": "false",
>> "spark.hadoop.fs.s3.serverSideEncryptionAlgorithm": "AES256",
>> "spark.hadoop.fs.s3.consistent": "true",
>> "spark.hadoop.fs.s3.consistent.retryPolicyType": "exponential",
>> "spark.hadoop.fs.s3.consistent.retryPeriodSeconds": "10",
>> "spark.hadoop.fs.s3.consistent.retryCount": "5",
>> "spark.hadoop.fs.s3.maxRetries": "4",
>> "spark.hadoop.fs.s3.sleepTimeSeconds": "10",
>> "spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency": "true",
>> "spark.hadoop.fs.s3.consistent.metadata.autoCreate": "true",
>> "spark.hadoop.fs.s3.consistent.metadata.tableName": "EmrFSMetadata",
>> "spark.hadoop.fs.s3.consistent.metadata.read.capacity": "500",
>> "spark.hadoop.fs.s3.consistent.metadata.write.capacity": "100",
>> "spark.hadoop.fs.s3.consistent.fastList": "true",
>> "spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata": "false",
>> "spark.hadoop.fs.s3.consistent.notification.CloudWatch": "false",
>> "spark.hadoop.fs.s3.consistent.notification.SQS": "false",
>>
>> Exception:
>> java.io.IOException: No such file or directory
>> at java.io.UnixFileSystem.createFileExclusively(Native Method)
>> at java.io.File.createNewFile(File.java:1006)
>> at java.io.File.createTempFile(File.java:1989)
>> at
>> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.startNewTempFile(S3FSOutputStream.java:269)
>> at
>> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.writeInternal(S3FSOutputStream.java:205)
>> at
>> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.flush(S3FSOutputStream.java:136)
>> at
>> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.close(S3FSOutputStream.java:156)
>> at
>> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
>> at
>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
>> at
>> org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109)
>> at
>> org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.close(MultipleOutputFormat.java:116)
>> at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Hints? Suggestions?
>>
>
>
>
> --
> http://sites.google.com/site/krasser/?utm_source=sig
>


Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Sven Krasser
>From your stacktrace it appears that the S3 writer tries to write the data
to a temp file on the local file system first. Taking a guess, that local
directory doesn't exist or you don't have permissions for it.
-Sven

On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> I am programmatically submit spark jobs in yarn-client mode on EMR.
> Whenever a job tries to save file to s3, it gives the below mentioned
> exception. I think the issue might be what EMR is not setup properly as I
> have to set all hadoop configurations manually in SparkContext. However, I
> am not sure which configuration am I missing (if any).
>
> Configurations that I am using in SparkContext to setup EMRFS:
> "spark.hadoop.fs.s3n.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem",
> "spark.hadoop.fs.s3.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem",
> "spark.hadoop.fs.emr.configuration.version": "1.0",
> "spark.hadoop.fs.s3n.multipart.uploads.enabled": "true",
> "spark.hadoop.fs.s3.enableServerSideEncryption": "false",
> "spark.hadoop.fs.s3.serverSideEncryptionAlgorithm": "AES256",
> "spark.hadoop.fs.s3.consistent": "true",
> "spark.hadoop.fs.s3.consistent.retryPolicyType": "exponential",
> "spark.hadoop.fs.s3.consistent.retryPeriodSeconds": "10",
> "spark.hadoop.fs.s3.consistent.retryCount": "5",
> "spark.hadoop.fs.s3.maxRetries": "4",
> "spark.hadoop.fs.s3.sleepTimeSeconds": "10",
> "spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency": "true",
> "spark.hadoop.fs.s3.consistent.metadata.autoCreate": "true",
> "spark.hadoop.fs.s3.consistent.metadata.tableName": "EmrFSMetadata",
> "spark.hadoop.fs.s3.consistent.metadata.read.capacity": "500",
> "spark.hadoop.fs.s3.consistent.metadata.write.capacity": "100",
> "spark.hadoop.fs.s3.consistent.fastList": "true",
> "spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata": "false",
> "spark.hadoop.fs.s3.consistent.notification.CloudWatch": "false",
> "spark.hadoop.fs.s3.consistent.notification.SQS": "false",
>
> Exception:
> java.io.IOException: No such file or directory
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1006)
> at java.io.File.createTempFile(File.java:1989)
> at
> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.startNewTempFile(S3FSOutputStream.java:269)
> at
> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.writeInternal(S3FSOutputStream.java:205)
> at
> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.flush(S3FSOutputStream.java:136)
> at
> com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.close(S3FSOutputStream.java:156)
> at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
> at
> org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109)
> at
> org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.close(MultipleOutputFormat.java:116)
> at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Hints? Suggestions?
>



-- 
http://sites.google.com/site/krasser/?utm_source=sig


Re: Define size partitions

2015-01-30 Thread Sven Krasser
You can also use your InputFormat/RecordReader in Spark, e.g. using
newAPIHadoopFile. See here:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
.
-Sven

On Fri, Jan 30, 2015 at 6:50 AM, Guillermo Ortiz 
wrote:

> Hi,
>
> I want to process some files, there're a king of big, dozens of
> gigabytes each one. I get them like a array of bytes and there's an
> structure inside of them.
>
> I have a header which describes the structure. It could be like:
> Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
> This structure appears N times on the file.
>
> So, I could know the size of each block since it's fix. There's not
> separator among block and block.
>
> If I would do this with MapReduce, I could implement a new
> RecordReader and InputFormat  to read each block because I know the
> size of them and I'd fix the split size in the driver. (blockX1000 for
> example). On this way, I could know that each split for each mapper
> has complete blocks and there isn't a piece of the last block in the
> next split.
>
> Spark works with RDD and partitions, How could I resize  each
> partition to do that?? is it possible? I guess that Spark doesn't use
> the RecordReader and these classes for these tasks.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
http://sites.google.com/site/krasser/?utm_source=sig


Re: Does the kFold in Spark always give you the same split?

2015-01-30 Thread Sean Owen
Have a look at the source code for MLUtils.kFold. Yes, there is a
random element. That's good; you want the folds to be randomly chosen.
Note there is a seed parameter, as in a lot of the APIs, that lets you
fix the RNG seed and so get the same result every time, if you need
to.

On Fri, Jan 30, 2015 at 4:12 PM, Jianguo Li  wrote:
> Hi,
>
> I am using the utility function kFold provided in Spark for doing k-fold
> cross validation using logistic regression. However, each time I run the
> experiment, I got different different result. Since everything else stays
> constant, I was wondering if this is due to the kFold function I used. Does
> anyone know if the kFold gives you a different split on a data set each time
> you call it?
>
> Thanks,
>
> Jianguo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
and if its a single giant timeseries that is already sorted then Mohit's
solution sounds good to me.

On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak 
wrote:

> But isn't foldLeft() overkill for the originally stated use case of max
> diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative
> non-associative accumulation as opposed to an embarrassingly parallel
> operation such as this one?
>
> This use case reminds me of FIR filtering in DSP. It seems that RDDs could
> use something that serves the same purpose as
> scala.collection.Iterator.sliding.
>
>   --
>  *From:* Koert Kuipers 
> *To:* Mohit Jaggi 
> *Cc:* Tobias Pfeiffer ; "Ganelin, Ilya" <
> ilya.gane...@capitalone.com>; derrickburns ; "
> user@spark.apache.org" 
> *Sent:* Friday, January 30, 2015 7:11 AM
> *Subject:* Re: spark challenge: zip with next???
>
> assuming the data can be partitioned then you have many timeseries for
> which you want to detect potential gaps. also assuming the resulting gaps
> info per timeseries is much smaller data then the timeseries data itself,
> then this is a classical example to me of a sorted (streaming) foldLeft,
> requiring an efficient secondary sort in the spark shuffle. i am trying to
> get that into spark here:
> https://issues.apache.org/jira/browse/SPARK-3655
>
>
>
> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi 
> wrote:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
>
> you can use the MLLib function or do the following (which is what I had
> done):
>
> - in first pass over the data, using mapPartitionWithIndex, gather the
> first item in each partition. you can use collect (or aggregator) for this.
> “key” them by the partition index. at the end, you will have a map
>(partition index) --> first item
> - in the second pass over the data, using mapPartitionWithIndex again,
> look at two (or in the general case N items at a time, you can use scala’s
> sliding iterator) items at a time and check the time difference(or any
> sliding window computation). To this mapParitition, pass the map created in
> previous step. You will need to use them to check the last item in this
> partition.
>
> If you can tolerate a few inaccuracies then you can just do the second
> step. You will miss the “boundaries” of the partitions but it might be
> acceptable for your use case.
>
>
>
> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer  wrote:
>
> Hi,
>
> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya <
> ilya.gane...@capitalone.com> wrote:
>
>  Make a copy of your RDD with an extra entry in the beginning to offset.
> The you can zip the two RDDs and run a map to generate an RDD of
> differences.
>
>
> Does that work? I recently tried something to compute differences between
> each entry and the next, so I did
>   val rdd1 = ... // null element + rdd
>   val rdd2 = ... // rdd + null element
> but got an error message about zip requiring data sizes in each partition
> to match.
>
> Tobias
>
>
>
>
>
>


Spark SQL - Unable to use Hive UDF because of ClassNotFoundException

2015-01-30 Thread Capitão
I've been trying to run HiveQL queries with UDFs in Spark SQL, but with no
success. The problem occurs only when using functions, like the
from_unixtime (represented by the Hive class UDFFromUnixTime).

I'm using Spark 1.2 with CDH5.3.0. Running the queries in local mode work,
but in Yarn mode don't. I'm creating an uber-jar with all the needed
dependencies, excluding the ones provided by the cluster (Spark, Hadoop) and
including the Hive ones. When I run the queries in Yarn I get the following
exception:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure:
Lost task 1.3 in stage 0.0 (TID 20, ):
java.lang.NoClassDefFoundError: Lorg/apache/hadoop/hive/ql/exec/UDF;
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2499)
at java.lang.Class.getDeclaredField(Class.java:1951)
at
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1659)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:480)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
scala.collection.immutable.$colon$colon.readObject(List.scala:362)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at
java.io.ObjectInputStream.read

Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
yeah i meant foldLeft by key, sorted by date.
it is non-commutative because i care about the order of processing the
values (chronological). i dont see how i can do it with a reduce
efficiently, but i would be curious to hear otherwise. i might be biased
since this is such a typical operation in map-reduce.

so basically assuming its logs of servers being RDD[(String, Long)] where
String is the server name and Long is the timestamp, you keep a state that
contains the last observed timestamp (if any) and the list of found gaps.
so state type would be (Option[Long], List[Long]). as you process items in
the timeseries chronologically you always update the last observed
timestamp and possible add to the list of found gaps.

foldLeftByKey on RDD[(K, V)] looks something like this:
def foldLeftByKey(state: X)(update: (X, V) => X)(implicit ord:
Ordering[V]): RDD[(K, X)]

and the logic would be (just made this up, didnt test or compile):

rdd.foldLeftByKey((None: Option[Long]), List.empty[Long])){
  case ((Some(prev), gaps), curr) if (curr - prev > thres) => (Some(curr),
curr :: gaps) // gap found
  case ((_, gaps, curr) => ((Some(curr), gaps) // no gap found
}

the sort required within timeseries would be done efficiently by spark in
the shuffle (assuming sort-based shuffle is enabled). the foldLeftByKey
would never require the entire timeseries per key to be in memory. however
every timeseries would be processed by a single task, so it might take a
while if the timeseries is very large.

On Fri, Jan 30, 2015 at 11:05 AM, Michael Malak 
wrote:

> But isn't foldLeft() overkill for the originally stated use case of max
> diff of adjacent pairs? Isn't foldLeft() for recursive non-commutative
> non-associative accumulation as opposed to an embarrassingly parallel
> operation such as this one?
>
> This use case reminds me of FIR filtering in DSP. It seems that RDDs could
> use something that serves the same purpose as
> scala.collection.Iterator.sliding.
>
>   --
>  *From:* Koert Kuipers 
> *To:* Mohit Jaggi 
> *Cc:* Tobias Pfeiffer ; "Ganelin, Ilya" <
> ilya.gane...@capitalone.com>; derrickburns ; "
> user@spark.apache.org" 
> *Sent:* Friday, January 30, 2015 7:11 AM
> *Subject:* Re: spark challenge: zip with next???
>
> assuming the data can be partitioned then you have many timeseries for
> which you want to detect potential gaps. also assuming the resulting gaps
> info per timeseries is much smaller data then the timeseries data itself,
> then this is a classical example to me of a sorted (streaming) foldLeft,
> requiring an efficient secondary sort in the spark shuffle. i am trying to
> get that into spark here:
> https://issues.apache.org/jira/browse/SPARK-3655
>
>
>
> On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi 
> wrote:
>
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
>
> you can use the MLLib function or do the following (which is what I had
> done):
>
> - in first pass over the data, using mapPartitionWithIndex, gather the
> first item in each partition. you can use collect (or aggregator) for this.
> “key” them by the partition index. at the end, you will have a map
>(partition index) --> first item
> - in the second pass over the data, using mapPartitionWithIndex again,
> look at two (or in the general case N items at a time, you can use scala’s
> sliding iterator) items at a time and check the time difference(or any
> sliding window computation). To this mapParitition, pass the map created in
> previous step. You will need to use them to check the last item in this
> partition.
>
> If you can tolerate a few inaccuracies then you can just do the second
> step. You will miss the “boundaries” of the partitions but it might be
> acceptable for your use case.
>
>
>
> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer  wrote:
>
> Hi,
>
> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya <
> ilya.gane...@capitalone.com> wrote:
>
>  Make a copy of your RDD with an extra entry in the beginning to offset.
> The you can zip the two RDDs and run a map to generate an RDD of
> differences.
>
>
> Does that work? I recently tried something to compute differences between
> each entry and the next, so I did
>   val rdd1 = ... // null element + rdd
>   val rdd2 = ... // rdd + null element
> but got an error message about zip requiring data sizes in each partition
> to match.
>
> Tobias
>
>
>
>
>
>


Re: spark-shell working in scala-2.11 (breaking change?)

2015-01-30 Thread Stephen Haberman
Hi Krishna/all,

I think I found it, and it wasn't related to Scala-2.11...

I had "spark.eventLog.dir=/mnt/spark/work/history", which worked
in Spark 1.2, but now am running Spark master, and it wants a
Hadoop URI, e.g. file:///mnt/spark/work/history (I believe due to
commit 45645191).

This looks like a breaking change to the spark.eventLog.dir
config property.

Perhaps it should be patched to convert the previously supported
"just a file path" values to HDFS-compatible "file://..." URIs
for backwards compatibility?

- Stephen


On Wed, 28 Jan 2015 12:27:17 -0800
Krishna Sankar  wrote:

> Stephen,
>Scala 2.11 worked fine for me. Did the dev change and then
> compile. Not using in production, but I go back and forth
> between 2.10 & 2.11. Cheers
> 
> 
> On Wed, Jan 28, 2015 at 12:18 PM, Stephen Haberman <
> stephen.haber...@gmail.com> wrote:
> 
> > Hey,
> >
> > I recently compiled Spark master against scala-2.11 (by
> > running the dev/change-versions script), but when I run
> > spark-shell, it looks like the "sc" variable is missing.
> >
> > Is this a known/unknown issue? Are others successfully using
> > Spark with scala-2.11, and specifically spark-shell?
> >
> > It is possible I did something dumb while compiling master,
> > but I'm not sure what it would be.
> >
> > Thanks,
> > Stephen
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
> >


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error when running spark in debug mode

2015-01-30 Thread Ankur Srivastava
Hi Arush

I have configured log4j by updating the file log4j.properties in
SPARK_HOME/conf folder.

If it was a log4j defect we would get error in debug mode in all apps.

Thanks
Ankur
 Hi Ankur,

How are you enabling the debug level of logs. It should be a log4j
configuration. Even if there would be some issue it would be in log4j and
not in spark.

Thanks
Arush

On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi,
>
> When ever I enable DEBUG level logs for my spark cluster, on running a job
> all the executors die with the below exception. On disabling the DEBUG logs
> my jobs move to the next step.
>
>
> I am on spark-1.1.0
>
> Is this a known issue with spark?
>
> Thanks
> Ankur
>
> 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager -
> SecurityManager: authentication disabled; ui acls disabled; users with view
> permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
>
> 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils - In
> createActorSystem, requireCookie is: off
>
> 2015-01-29 22:27:42,871
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
> akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>
> 2015-01-29 22:27:42,912
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Starting remoting
>
> 2015-01-29 22:27:43,057
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Remoting started; listening on addresses :[akka.tcp://
> driverPropsFetcher@10.77.9.155:36035]
>
> 2015-01-29 22:27:43,060
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Remoting now listens on addresses: [akka.tcp://
> driverPropsFetcher@10.77.9.155:36035]
>
> 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
> Successfully started service 'driverPropsFetcher' on port 36035.
>
> 2015-01-29 22:28:13,077 [main] ERROR
> org.apache.hadoop.security.UserGroupInformation -
> PriviledgedActionException as:ubuntu
> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
> seconds]
>
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException:
> Unknown exception in doAs
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>
> Caused by: java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>
> ... 4 more
>
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [30 seconds]
>
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>
> at scala.concurrent.Await$.result(package.scala:107)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
>
> ... 7 more
>



-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-30 Thread Ayoub
I am not personally aware of a repo for snapshot builds.
In my use case, I had to build spark 1.2.1-snapshot

see https://spark.apache.org/docs/latest/building-spark.html

2015-01-30 17:11 GMT+01:00 Debajyoti Roy :

> Thanks Ayoub and Zhan,
> I am new to spark and wanted to make sure i am not trying something stupid
> or using a wrong API.
>
> Is there a repo where i can pull the snapshot or nighly builds for spark ?
>
>
>
> On Fri, Jan 30, 2015 at 2:45 AM, Ayoub Benali  > wrote:
>
>> Hello,
>>
>> I had the same issue then I found this JIRA ticket
>> https://issues.apache.org/jira/browse/SPARK-4825
>> So I switched to Spark 1.2.1-snapshot witch solved the problem.
>>
>>
>>
>> 2015-01-30 8:40 GMT+01:00 Zhan Zhang :
>>
>>>  I think it is expected. Refer to the comments in saveAsTable "Note that
>>> this currently only works with SchemaRDDs that are created from a
>>> HiveContext". If I understand correctly, here the SchemaRDD means those
>>> generated by HiveContext.sql, instead of applySchema.
>>>
>>>  Thanks.
>>>
>>>  Zhan Zhang
>>>
>>>
>>>
>>>  On Jan 29, 2015, at 9:38 PM, matroyd 
>>> wrote:
>>>
>>> Hi, I am trying saveAsTable on SchemaRDD created from HiveContext and it
>>> fails. This is on Spark 1.2.0. Following are details of the code, command
>>> and exceptions:
>>> http://stackoverflow.com/questions/28222496/how-to-enable-sql-on-schemardd-via-the-jdbc-interface-is-it-even-possible
>>> 
>>> Thanks in advance for any guidance
>>> --
>>> View this message in context: HiveContext created SchemaRDD's
>>> saveAsTable is not working on 1.2.0
>>> 
>>> Sent from the Apache Spark User List mailing list archive
>>> 
>>> at Nabble.com.
>>>
>>>
>>>
>>
>
>
> --
> Thanks,
>
> *Debajyoti Roy*
> debajyoti@healthagen.com
> (646)561-0844 <646-561-0844>
> 350 Madison Ave., FL 16,
> New York, NY 10017.
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Re-HiveContext-created-SchemaRDD-s-saveAsTable-is-not-working-on-1-2-0-tp21442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: HW imbalance

2015-01-30 Thread Sandy Ryza
Yup, if you turn off YARN's CPU scheduling then you can run executors to
take advantage of the extra memory on the larger boxes. But then some of
the nodes will end up severely oversubscribed from a CPU perspective, so I
would definitely recommend against that.



On Fri, Jan 30, 2015 at 3:31 AM, Michael Segel 
wrote:

> Sorry, but I think there’s a disconnect.
>
> When you launch a job under YARN on any of the hadoop clusters, the number
> of mappers/reducers is not set and is dependent on the amount of available
> resources.
> So under Ambari, CM, or MapR’s Admin, you should be able to specify the
> amount of resources available on any node which is to be allocated to
> YARN’s RM.
> So if your node has 32GB allocated, you can run N jobs concurrently based
> on the amount of resources you request when you submit your application.
>
> If you have 64GB allocated, you can run up to 2N jobs concurrently based
> on the same memory constraints.
>
> In terms of job scheduling, where and when a job can run is going to be
> based on available resources.  So if you want to run a job that needs 16GB
> of resources, and all of your nodes are busy and only have 4GB per node
> available to YARN, your 16GB job will wait until there is at least that
> much resources available.
>
> To your point, if you say you need 4GB per task, then it must be the same
> per task for that job. The larger the cluster node, in this case memory,
> the more jobs you can run.
>
> This is of course assuming you could over subscribe a node in terms of cpu
> cores if you have memory available.
>
> YMMV
>
> HTH
> -Mike
>
> On Jan 30, 2015, at 7:10 AM, Sandy Ryza  wrote:
>
> My answer was based off the specs that Antony mentioned: different amounts
> of memory, but 10 cores on all the boxes.  In that case, a single Spark
> application's homogeneously sized executors won't be able to take advantage
> of the extra memory on the bigger boxes.
>
> Cloudera Manager can certainly configure YARN with different resource
> profiles for different nodes if that's what you're wondering.
>
> -Sandy
>
> On Thu, Jan 29, 2015 at 11:03 PM, Michael Segel  > wrote:
>
>> @Sandy,
>>
>> There are two issues.
>> The spark context (executor) and then the cluster under YARN.
>>
>> If you have a box where each yarn job needs 3GB,  and your machine has
>> 36GB dedicated as a YARN resource, you can run 12 executors on the single
>> node.
>> If you have a box that has 72GB dedicated to YARN, you can run up to 24
>> contexts (executors) in parallel.
>>
>> Assuming that you’re not running any other jobs.
>>
>> The larger issue is if your version of Hadoop will easily let you run
>> with multiple profiles or not. Ambari (1.6 and early does not.) Its
>> supposed to be fixed in 1.7 but I haven’t evaluated it yet.
>> Cloudera? YMMV
>>
>> If I understood the question raised by the OP, its more about a
>> heterogeneous cluster than spark.
>>
>> -Mike
>>
>> On Jan 26, 2015, at 5:02 PM, Sandy Ryza  wrote:
>>
>> Hi Antony,
>>
>> Unfortunately, all executors for any single Spark application must have
>> the same amount of memory.  It's possibly to configure YARN with different
>> amounts of memory for each host (using
>> yarn.nodemanager.resource.memory-mb), so other apps might be able to take
>> advantage of the extra memory.
>>
>> -Sandy
>>
>> On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel > > wrote:
>>
>>> If you’re running YARN, then you should be able to mix and max where
>>> YARN is managing the resources available on the node.
>>>
>>> Having said that… it depends on which version of Hadoop/YARN.
>>>
>>> If you’re running Hortonworks and Ambari, then setting up multiple
>>> profiles may not be straight forward. (I haven’t seen the latest version of
>>> Ambari)
>>>
>>> So in theory, one profile would be for your smaller 36GB of ram, then
>>> one profile for your 128GB sized machines.
>>> Then as your request resources for your spark job, it should schedule
>>> the jobs based on the cluster’s available resources.
>>> (At least in theory.  I haven’t tried this so YMMV)
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Jan 26, 2015, at 4:25 PM, Antony Mayi 
>>> wrote:
>>>
>>> should have said I am running as yarn-client. all I can see is
>>> specifying the generic executor memory that is then to be used in all
>>> containers.
>>>
>>>
>>>   On Monday, 26 January 2015, 16:48, Charles Feduke <
>>> charles.fed...@gmail.com> wrote:
>>>
>>>
>>>
>>> You should look at using Mesos. This should abstract away the individual
>>> hosts into a pool of resources and make the different physical
>>> specifications manageable.
>>>
>>> I haven't tried configuring Spark Standalone mode to have different
>>> specs on different machines but based on spark-env.sh.template:
>>>
>>> # - SPARK_WORKER_CORES, to set the number of cores to use on this machine
>>> # - SPARK_WORKER_MEMORY, to set how much total memory workers have to
>>> give executors (e.g. 1000m, 2g)
>>> # - SPARK_WORKER_OPTS, to set config properties only f

Re: Negative Accumulators

2015-01-30 Thread francois . garillot
Sanity-check: would it be possible that `threshold_var` be negative ?



—
FG

On Fri, Jan 30, 2015 at 5:06 PM, Peter Thai  wrote:

> Hello,
> I am seeing negative values for accumulators. Here's my implementation in a
> standalone app in Spark 1.1.1rc:
>   implicit object BigIntAccumulatorParam extends AccumulatorParam[BigInt] {
> def addInPlace(t1: Int, t2: BigInt) = BigInt(t1) + t2
> def addInPlace(t1: BigInt, t2: BigInt) = t1 + t2
> def zero(initialValue: BigInt) = BigInt(0)
>   }
> val capped_numpings_accu = sc.accumulator(BigInt(0))(BigIntAccumulatorParam)
> myRDD.foreach(x=>{ capped_numpings_accu+=BigInt(x._1).min(threshold_var)})
> When I remove the min() condition, I no longer see negative values.
> Thanks!
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Negative-Accumulators-tp21441.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

Does the kFold in Spark always give you the same split?

2015-01-30 Thread Jianguo Li
Hi,

I am using the utility function kFold provided in Spark for doing k-fold
cross validation using logistic regression. However, each time I run the
experiment, I got different different result. Since everything else stays
constant, I was wondering if this is due to the kFold function I used. Does
anyone know if the kFold gives you a different split on a data set each
time you call it?

Thanks,

Jianguo


Re: spark challenge: zip with next???

2015-01-30 Thread Michael Malak
But isn't foldLeft() overkill for the originally stated use case of max diff of 
adjacent pairs? Isn't foldLeft() for recursive non-commutative non-associative 
accumulation as opposed to an embarrassingly parallel operation such as this 
one?
This use case reminds me of FIR filtering in DSP. It seems that RDDs could use 
something that serves the same purpose as scala.collection.Iterator.sliding.
  From: Koert Kuipers 
 To: Mohit Jaggi  
Cc: Tobias Pfeiffer ; "Ganelin, Ilya" 
; derrickburns ; 
"user@spark.apache.org"  
 Sent: Friday, January 30, 2015 7:11 AM
 Subject: Re: spark challenge: zip with next???
   
assuming the data can be partitioned then you have many timeseries for which 
you want to detect potential gaps. also assuming the resulting gaps info per 
timeseries is much smaller data then the timeseries data itself, then this is a 
classical example to me of a sorted (streaming) foldLeft, requiring an 
efficient secondary sort in the spark shuffle. i am trying to get that into 
spark here:
https://issues.apache.org/jira/browse/SPARK-3655



On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi  wrote:

http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
you can use the MLLib function or do the following (which is what I had done):
- in first pass over the data, using mapPartitionWithIndex, gather the first 
item in each partition. you can use collect (or aggregator) for this. “key” 
them by the partition index. at the end, you will have a map   (partition 
index) --> first item- in the second pass over the data, using 
mapPartitionWithIndex again, look at two (or in the general case N items at a 
time, you can use scala’s sliding iterator) items at a time and check the time 
difference(or any sliding window computation). To this mapParitition, pass the 
map created in previous step. You will need to use them to check the last item 
in this partition.
If you can tolerate a few inaccuracies then you can just do the second step. 
You will miss the “boundaries” of the partitions but it might be acceptable for 
your use case.



On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer  wrote:
Hi,

On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya  
wrote:

Make a copy of your RDD with an extra entry in the beginning to offset. The you 
can zip the two RDDs and run a map to generate an RDD of differences.


Does that work? I recently tried something to compute differences between each 
entry and the next, so I did  val rdd1 = ... // null element + rdd  val rdd2 = 
... // rdd + null elementbut got an error message about zip requiring data 
sizes in each partition to match.
Tobias






  

Negative Accumulators

2015-01-30 Thread Peter Thai
Hello,

I am seeing negative values for accumulators. Here's my implementation in a
standalone app in Spark 1.1.1rc:

  implicit object BigIntAccumulatorParam extends AccumulatorParam[BigInt] {
def addInPlace(t1: Int, t2: BigInt) = BigInt(t1) + t2
def addInPlace(t1: BigInt, t2: BigInt) = t1 + t2
def zero(initialValue: BigInt) = BigInt(0)
  }

val capped_numpings_accu = sc.accumulator(BigInt(0))(BigIntAccumulatorParam)
myRDD.foreach(x=>{ capped_numpings_accu+=BigInt(x._1).min(threshold_var)})

When I remove the min() condition, I no longer see negative values.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Negative-Accumulators-tp21441.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Define size partitions

2015-01-30 Thread Guillermo Ortiz
Hi,

I want to process some files, there're a king of big, dozens of
gigabytes each one. I get them like a array of bytes and there's an
structure inside of them.

I have a header which describes the structure. It could be like:
Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), ..
This structure appears N times on the file.

So, I could know the size of each block since it's fix. There's not
separator among block and block.

If I would do this with MapReduce, I could implement a new
RecordReader and InputFormat  to read each block because I know the
size of them and I'd fix the split size in the driver. (blockX1000 for
example). On this way, I could know that each split for each mapper
has complete blocks and there isn't a piece of the last block in the
next split.

Spark works with RDD and partitions, How could I resize  each
partition to do that?? is it possible? I guess that Spark doesn't use
the RecordReader and these classes for these tasks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Aniket Bhatnagar
I am programmatically submit spark jobs in yarn-client mode on EMR.
Whenever a job tries to save file to s3, it gives the below mentioned
exception. I think the issue might be what EMR is not setup properly as I
have to set all hadoop configurations manually in SparkContext. However, I
am not sure which configuration am I missing (if any).

Configurations that I am using in SparkContext to setup EMRFS:
"spark.hadoop.fs.s3n.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem",
"spark.hadoop.fs.s3.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem",
"spark.hadoop.fs.emr.configuration.version": "1.0",
"spark.hadoop.fs.s3n.multipart.uploads.enabled": "true",
"spark.hadoop.fs.s3.enableServerSideEncryption": "false",
"spark.hadoop.fs.s3.serverSideEncryptionAlgorithm": "AES256",
"spark.hadoop.fs.s3.consistent": "true",
"spark.hadoop.fs.s3.consistent.retryPolicyType": "exponential",
"spark.hadoop.fs.s3.consistent.retryPeriodSeconds": "10",
"spark.hadoop.fs.s3.consistent.retryCount": "5",
"spark.hadoop.fs.s3.maxRetries": "4",
"spark.hadoop.fs.s3.sleepTimeSeconds": "10",
"spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency": "true",
"spark.hadoop.fs.s3.consistent.metadata.autoCreate": "true",
"spark.hadoop.fs.s3.consistent.metadata.tableName": "EmrFSMetadata",
"spark.hadoop.fs.s3.consistent.metadata.read.capacity": "500",
"spark.hadoop.fs.s3.consistent.metadata.write.capacity": "100",
"spark.hadoop.fs.s3.consistent.fastList": "true",
"spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata": "false",
"spark.hadoop.fs.s3.consistent.notification.CloudWatch": "false",
"spark.hadoop.fs.s3.consistent.notification.SQS": "false",

Exception:
java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1006)
at java.io.File.createTempFile(File.java:1989)
at
com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.startNewTempFile(S3FSOutputStream.java:269)
at
com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.writeInternal(S3FSOutputStream.java:205)
at
com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.flush(S3FSOutputStream.java:136)
at
com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.close(S3FSOutputStream.java:156)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
at
org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109)
at
org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.close(MultipleOutputFormat.java:116)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Hints? Suggestions?


[Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Yifan LI

> 
> 
> Hi,
> 
> I am running my graphx application on Spark 1.2.0(11 nodes cluster), has 
> requested 30GB memory per node and 100 cores for around 1GB input dataset(5 
> million vertices graph).
> 
> But the error below always happen…
> 
> Is there anyone could give me some points? 
> 
> (BTW, the overall edge/vertex RDDs will reach more than 100GB during graph 
> computation, and another version of my application can work well on the same 
> dataset while it need much less memory during computation)
> 
> Thanks in advance!!!
> 
> 
> 15/01/29 18:05:08 ERROR ContextCleaner: Error cleaning broadcast 60
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at 
> org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
>   at 
> org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
>   at 
> org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
>   at 
> org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
>   at org.apache.spark.ContextCleaner.org 
> $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)
>   at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
> [Stage 91:===>  (2 + 4) / 
> 6]15/01/29 18:08:15 ERROR SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 0
> [Stage 93:>  (29 + 20) / 
> 49]15/01/29 23:47:03 ERROR TaskSchedulerImpl: Lost executor 9 on 
> small11-tap1.common.lip6.fr : remote 
> Akka client disassociated
> [Stage 83:>   (1 + 0) / 6][Stage 86:>   (0 + 1) / 2][Stage 88:>   (0 + 2) / 
> 8]15/01/29 23:47:06 ERROR SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 9
> [Stage 83:===>  (5 + 1) / 6][Stage 88:=>   (9 + 2) / 
> 11]15/01/29 23:57:30 ERROR TaskSchedulerImpl: Lost executor 8 on 
> small10-tap1.common.lip6.fr : remote 
> Akka client disassociated
> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 8
> 15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove 
> non-existent executor 8
> 
> Best,
> Yifan LI
> 
> 
> 
> 
> 



[Graphx & Spark] Error of "Lost executor" and TimeoutException

2015-01-30 Thread Yifan LI
Hi,

I am running my graphx application on Spark 1.2.0(11 nodes cluster), has 
requested 30GB memory per node and 100 cores for around 1GB input dataset(5 
million vertices graph).

But the error below always happen…

Is there anyone could give me some points? 

(BTW, the overall edge/vertex RDDs will reach more than 100GB during graph 
computation, and another version of my application can work well on the same 
dataset while it need much less memory during computation)

Thanks in advance!!!


15/01/29 18:05:08 ERROR ContextCleaner: Error cleaning broadcast 60
java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:137)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:227)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)
at 
org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:185)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:147)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:138)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:138)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
at 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply(ContextCleaner.scala:134)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1460)
at 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:133)
at org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:65)
[Stage 91:===>  (2 + 4) / 
6]15/01/29 18:08:15 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 0
[Stage 93:>  (29 + 20) / 
49]15/01/29 23:47:03 ERROR TaskSchedulerImpl: Lost executor 9 on 
small11-tap1.common.lip6.fr: remote Akka client disassociated
[Stage 83:>   (1 + 0) / 6][Stage 86:>   (0 + 1) / 2][Stage 88:>   (0 + 2) / 
8]15/01/29 23:47:06 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 9
[Stage 83:===>  (5 + 1) / 6][Stage 88:=>   (9 + 2) / 
11]15/01/29 23:57:30 ERROR TaskSchedulerImpl: Lost executor 8 on 
small10-tap1.common.lip6.fr: remote Akka client disassociated
15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 8
15/01/29 23:57:30 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 8

Best,
Yifan LI







Re: spark challenge: zip with next???

2015-01-30 Thread Koert Kuipers
assuming the data can be partitioned then you have many timeseries for
which you want to detect potential gaps. also assuming the resulting gaps
info per timeseries is much smaller data then the timeseries data itself,
then this is a classical example to me of a sorted (streaming) foldLeft,
requiring an efficient secondary sort in the spark shuffle. i am trying to
get that into spark here:
https://issues.apache.org/jira/browse/SPARK-3655

On Fri, Jan 30, 2015 at 12:27 AM, Mohit Jaggi  wrote:

>
> http://mail-archives.apache.org/mod_mbox/spark-user/201405.mbox/%3ccalrvtpkn65rolzbetc+ddk4o+yjm+tfaf5dz8eucpl-2yhy...@mail.gmail.com%3E
>
> you can use the MLLib function or do the following (which is what I had
> done):
>
> - in first pass over the data, using mapPartitionWithIndex, gather the
> first item in each partition. you can use collect (or aggregator) for this.
> “key” them by the partition index. at the end, you will have a map
>(partition index) --> first item
> - in the second pass over the data, using mapPartitionWithIndex again,
> look at two (or in the general case N items at a time, you can use scala’s
> sliding iterator) items at a time and check the time difference(or any
> sliding window computation). To this mapParitition, pass the map created in
> previous step. You will need to use them to check the last item in this
> partition.
>
> If you can tolerate a few inaccuracies then you can just do the second
> step. You will miss the “boundaries” of the partitions but it might be
> acceptable for your use case.
>
>
>
> On Jan 29, 2015, at 4:36 PM, Tobias Pfeiffer  wrote:
>
> Hi,
>
> On Fri, Jan 30, 2015 at 6:32 AM, Ganelin, Ilya <
> ilya.gane...@capitalone.com> wrote:
>
>>  Make a copy of your RDD with an extra entry in the beginning to offset.
>> The you can zip the two RDDs and run a map to generate an RDD of
>> differences.
>>
>
> Does that work? I recently tried something to compute differences between
> each entry and the next, so I did
>   val rdd1 = ... // null element + rdd
>   val rdd2 = ... // rdd + null element
> but got an error message about zip requiring data sizes in each partition
> to match.
>
> Tobias
>
>
>


Re: Building Spark behind a proxy

2015-01-30 Thread Arush Kharbanda
Hi Somya,

I meant when you configure the JAVA_OPTS and when you don't configure the
JAVA_OPTS is there any difference in the error message?

Are you facing the same issue when you built using maven?

Thanks
Arush

On Thu, Jan 29, 2015 at 10:22 PM, Soumya Simanta 
wrote:

> I can do a
> wget
> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
> and get the file successfully on a shell.
>
>
>
> On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas 
> wrote:
>
>> At least a part of it is due to connection refused, can you check if curl
>> can reach the URL with proxies -
>> [FATAL] Non-resolvable parent POM: Could not transfer artifact
>> org.apache:apache:pom:14 from/to central (
>> http://repo.maven.apache.org/maven2): Error transferring file:
>> Connection refused from
>> http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
>>
>> On Thu, Jan 29, 2015 at 11:35 AM, Soumya Simanta <
>> soumya.sima...@gmail.com> wrote:
>>
>>>
>>>
>>> On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda <
>>> ar...@sigmoidanalytics.com> wrote:
>>>
 Does  the error change on build with and without the built options?

>>> What do you mean by build options? I'm just doing ./sbt/sbt assembly
>>> from $SPARK_HOME
>>>
>>>
 Did you try using maven? and doing the proxy settings there.

>>>
>>>  No I've not tried maven yet. However, I did set proxy settings inside
>>> my .m2/setting.xml, but it didn't make any difference.
>>>
>>>
>>>
>>
>


-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Error when running spark in debug mode

2015-01-30 Thread Arush Kharbanda
Hi Ankur,

How are you enabling the debug level of logs. It should be a log4j
configuration. Even if there would be some issue it would be in log4j and
not in spark.

Thanks
Arush

On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava <
ankur.srivast...@gmail.com> wrote:

> Hi,
>
> When ever I enable DEBUG level logs for my spark cluster, on running a job
> all the executors die with the below exception. On disabling the DEBUG logs
> my jobs move to the next step.
>
>
> I am on spark-1.1.0
>
> Is this a known issue with spark?
>
> Thanks
> Ankur
>
> 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager -
> SecurityManager: authentication disabled; ui acls disabled; users with view
> permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
>
> 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils - In
> createActorSystem, requireCookie is: off
>
> 2015-01-29 22:27:42,871
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
> akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>
> 2015-01-29 22:27:42,912
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Starting remoting
>
> 2015-01-29 22:27:43,057
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Remoting started; listening on addresses :[akka.tcp://
> driverPropsFetcher@10.77.9.155:36035]
>
> 2015-01-29 22:27:43,060
> [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
> Remoting now listens on addresses: [akka.tcp://
> driverPropsFetcher@10.77.9.155:36035]
>
> 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
> Successfully started service 'driverPropsFetcher' on port 36035.
>
> 2015-01-29 22:28:13,077 [main] ERROR
> org.apache.hadoop.security.UserGroupInformation -
> PriviledgedActionException as:ubuntu
> cause:java.util.concurrent.TimeoutException: Futures timed out after [30
> seconds]
>
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException:
> Unknown exception in doAs
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>
> Caused by: java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>
> ... 4 more
>
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [30 seconds]
>
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>
> at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>
> at scala.concurrent.Await$.result(package.scala:107)
>
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
>
> ... 7 more
>



-- 

[image: Sigmoid Analytics] 

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: WARN NativeCodeLoader warning in spark shell

2015-01-30 Thread Sean Owen
This is ignorable, and a message from Hadoop, which basically means
what it says. It's almost infamous; search Google. You don't have to
do anything.

On Fri, Jan 30, 2015 at 1:04 PM, kundan kumar  wrote:
> Hi,
>
> Whenever I start spark shell I get this warning.
>
> WARN NativeCodeLoader: Unable to load native-hadoop library for your
> platform... using builtin-java classes where applicable
>
> Whats the meaning of this and does/how can it impact the execution of my
> spark jobs ?
>
> Please suggest how can I fix this ?
>
>
> Thanks !!
> Kundan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



WARN NativeCodeLoader warning in spark shell

2015-01-30 Thread kundan kumar
Hi,

Whenever I start spark shell I get this warning.

WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable

Whats the meaning of this and does/how can it impact the execution of my
spark jobs ?

Please suggest how can I fix this ?


Thanks !!
Kundan


HBase Thrift API Error on map/reduce functions

2015-01-30 Thread mtheofilos
I get a serialization problem trying to run

Python:
sc.parallelize(['1','2']).map(lambda id: client.getRow('table', id, None))

cloudpickle.py can't pickle method_descriptor type
I add a function to pickle a method descriptor and now it exceeds the
recursion limit
I print the method name before i pickle it and it is "reset" from
cStringIO.StringO (output)
The problem was at line ~830 of cloudpickle, trying to pickle a file
And the initial object to pickle was that:
(, None, PairDeserializer(UTF8Deserializer(),
UTF8Deserializer()), BatchedSerializer(PickleSerializer(), 0))

And the error is this:
  File "/home/user/inverted-index.py", line 80, in 
print
sc.wholeTextFiles(data_dir).flatMap(update).take(2)#.groupByKey().map(store).take(2)
  File "/home/user/spark2/python/pyspark/rdd.py", line 1081, in take
totalParts = self._jrdd.partitions().size()
  File "/home/user/spark2/python/pyspark/rdd.py", line 2107, in _jrdd
pickled_command = ser.dumps(command)
  File "/home/user/spark2/python/pyspark/serializers.py", line 402, in dumps
return cloudpickle.dumps(obj, 2)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 832, in dumps
cp.dump(obj)
  File "/home/user/spark2/python/pyspark/cloudpickle.py", line 147, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not pickle object as excessively deep recursion
required.
Try _fast_serialization=2 or contact PiCloud support

Can any developer that works in that stuff tell me if that problem can be
fixed?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-Thrift-API-Error-on-map-reduce-functions-tp21439.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HW imbalance

2015-01-30 Thread Michael Segel
Sorry, but I think there’s a disconnect. 

When you launch a job under YARN on any of the hadoop clusters, the number of 
mappers/reducers is not set and is dependent on the amount of available 
resources. 
So under Ambari, CM, or MapR’s Admin, you should be able to specify the amount 
of resources available on any node which is to be allocated to YARN’s RM. 
So if your node has 32GB allocated, you can run N jobs concurrently based on 
the amount of resources you request when you submit your application. 

If you have 64GB allocated, you can run up to 2N jobs concurrently based on the 
same memory constraints. 

In terms of job scheduling, where and when a job can run is going to be based 
on available resources.  So if you want to run a job that needs 16GB of 
resources, and all of your nodes are busy and only have 4GB per node available 
to YARN, your 16GB job will wait until there is at least that much resources 
available.   

To your point, if you say you need 4GB per task, then it must be the same per 
task for that job. The larger the cluster node, in this case memory, the more 
jobs you can run. 

This is of course assuming you could over subscribe a node in terms of cpu 
cores if you have memory available. 

YMMV

HTH
-Mike

On Jan 30, 2015, at 7:10 AM, Sandy Ryza  wrote:

> My answer was based off the specs that Antony mentioned: different amounts of 
> memory, but 10 cores on all the boxes.  In that case, a single Spark 
> application's homogeneously sized executors won't be able to take advantage 
> of the extra memory on the bigger boxes.
> 
> Cloudera Manager can certainly configure YARN with different resource 
> profiles for different nodes if that's what you're wondering.
> 
> -Sandy
> 
> On Thu, Jan 29, 2015 at 11:03 PM, Michael Segel  
> wrote:
> @Sandy, 
> 
> There are two issues. 
> The spark context (executor) and then the cluster under YARN. 
> 
> If you have a box where each yarn job needs 3GB,  and your machine has 36GB 
> dedicated as a YARN resource, you can run 12 executors on the single node. 
> If you have a box that has 72GB dedicated to YARN, you can run up to 24 
> contexts (executors) in parallel. 
> 
> Assuming that you’re not running any other jobs. 
> 
> The larger issue is if your version of Hadoop will easily let you run with 
> multiple profiles or not. Ambari (1.6 and early does not.) Its supposed to be 
> fixed in 1.7 but I haven’t evaluated it yet. 
> Cloudera? YMMV
> 
> If I understood the question raised by the OP, its more about a heterogeneous 
> cluster than spark.
> 
> -Mike
> 
> On Jan 26, 2015, at 5:02 PM, Sandy Ryza  wrote:
> 
>> Hi Antony,
>> 
>> Unfortunately, all executors for any single Spark application must have the 
>> same amount of memory.  It's possibly to configure YARN with different 
>> amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), 
>> so other apps might be able to take advantage of the extra memory.
>> 
>> -Sandy
>> 
>> On Mon, Jan 26, 2015 at 8:34 AM, Michael Segel  
>> wrote:
>> If you’re running YARN, then you should be able to mix and max where YARN is 
>> managing the resources available on the node. 
>> 
>> Having said that… it depends on which version of Hadoop/YARN. 
>> 
>> If you’re running Hortonworks and Ambari, then setting up multiple profiles 
>> may not be straight forward. (I haven’t seen the latest version of Ambari) 
>> 
>> So in theory, one profile would be for your smaller 36GB of ram, then one 
>> profile for your 128GB sized machines. 
>> Then as your request resources for your spark job, it should schedule the 
>> jobs based on the cluster’s available resources. 
>> (At least in theory.  I haven’t tried this so YMMV) 
>> 
>> HTH
>> 
>> -Mike
>> 
>> On Jan 26, 2015, at 4:25 PM, Antony Mayi  
>> wrote:
>> 
>>> should have said I am running as yarn-client. all I can see is specifying 
>>> the generic executor memory that is then to be used in all containers.
>>> 
>>> 
>>> On Monday, 26 January 2015, 16:48, Charles Feduke 
>>>  wrote:
>>> 
>>> 
>>> You should look at using Mesos. This should abstract away the individual 
>>> hosts into a pool of resources and make the different physical 
>>> specifications manageable.
>>> 
>>> I haven't tried configuring Spark Standalone mode to have different specs 
>>> on different machines but based on spark-env.sh.template:
>>> 
>>> # - SPARK_WORKER_CORES, to set the number of cores to use on this machine
>>> # - SPARK_WORKER_MEMORY, to set how much total memory workers have to give 
>>> executors (e.g. 1000m, 2g)
>>> # - SPARK_WORKER_OPTS, to set config properties only for the worker (e.g. 
>>> "-Dx=y")
>>> it looks like you should be able to mix. (Its not clear to me whether 
>>> SPARK_WORKER_MEMORY is uniform across the cluster or for the machine where 
>>> the config file resides.)
>>> 
>>> On Mon Jan 26 2015 at 8:07:51 AM Antony Mayi  
>>> wrote:
>>> Hi,
>>> 
>>> is it possible to mix hosts with (significantly) different specs within a 
>>> cluster 

Re: KMeans with large clusters Java Heap Space

2015-01-30 Thread derrickburns
By default, HashingTF turns each document into a sparse vector in R^(2^20),
i.e.  a million dimensional space. The current Spark clusterer turns each
sparse into a dense vector with a million entries when it is added to a
cluster.  Hence, the memory needed grows as the number of clusters times 8M
bytes (8 bytes per double)

You should try to use my new   generalized kmeans clustering package
  , which
works on high dimensional sparse data.  

You will want to use the RandomIndexing embedding:

def sparseTrain(raw: RDD[Vector], k: Int): KMeansModel = {
KMeans.train(raw, k, embeddingNames = List(LOW_DIMENSIONAL_RI)
 }



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-tp21432p21437.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Read from file and broadcast before every Spark Streaming bucket?

2015-01-30 Thread Sean Owen
You should say what errors you see. But I assume it is because you try to
create broadcast variables on the executors. Why? Sounds like you already
have the data you want everywhere to read locally.
On Jan 30, 2015 4:06 AM, "YaoPau"  wrote:

> I'm creating a real-time visualization of counts of ads shown on my
> website,
> using that data pushed through by Spark Streaming.
>
> To avoid clutter, it only looks good to show 4 or 5 lines on my
> visualization at once (corresponding to 4 or 5 different ads), but there
> are
> 50+ different ads that show on my site.
>
> What I'd like to do is quickly change which ads to pump through Spark
> Streaming, without having to rebuild the .jar and push it to my edge node.
> Ideally I'd have a .csv file on my edge node with a list of 4 ad names, and
> every time a StreamRDD is created it reads from that tiny file, creates a
> broadcast variable, and uses that variable as a filter.  That way I could
> just open up the .csv file, save it, and the stream filters correctly
> automatically.
>
> I keep getting errors when I try this.  Has anyone had success with a
> broadcast variable that updates with each new streamRDD?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-from-file-and-broadcast-before-every-Spark-Streaming-bucket-tp21433.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: unknown issue in submitting a spark job

2015-01-30 Thread Sean Owen
You should not disable the GC overhead limit. How does increasing executor
total memory cause you to not have enough memory? Do you mean something
else?
On Jan 30, 2015 1:16 AM, "ey-chih chow"  wrote:

> I use the default value, which I think is 512MB.  If I change to 1024MB,
> Spark submit will fail due to not enough memory for rdd.
>
> Ey-Chih Chow
>
> --
> From: moham...@glassbeam.com
> To: eyc...@hotmail.com; user@spark.apache.org
> Subject: RE: unknown issue in submitting a spark job
> Date: Fri, 30 Jan 2015 00:32:57 +
>
>  How much memory are you assigning to the Spark executor on the worker
> node?
>
>
>
> Mohammed
>
>
>
> *From:* ey-chih chow [mailto:eyc...@hotmail.com]
> *Sent:* Thursday, January 29, 2015 3:35 PM
> *To:* Mohammed Guller; user@spark.apache.org
> *Subject:* RE: unknown issue in submitting a spark job
>
>
>
> The worker node has 15G memory, 1x32 GB SSD, and 2 core.  The data file is
> from S3. If I don't set mapred.max.split.size, it is fine with only one
> partition.  Otherwise, it will generate OOME.
>
>
>
> Ey-Chih Chow
>
>
>
> > From: moham...@glassbeam.com
>
> > To: eyc...@hotmail.com; user@spark.apache.org
> > Subject: RE: unknown issue in submitting a spark job
> > Date: Thu, 29 Jan 2015 21:16:13 +
> >
> > Looks like the application is using a lot more memory than available.
> Could be a bug somewhere in the code or just underpowered machine. Hard to
> say without looking at the code.
> >
> > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >
> > Mohammed
> >
> >
> > -Original Message-
> > From: ey-chih chow [mailto:eyc...@hotmail.com ]
> > Sent: Thursday, January 29, 2015 1:06 AM
> > To: user@spark.apache.org
> > Subject: unknown issue in submitting a spark job
> >
> > Hi,
> >
> > I submitted a job using spark-submit and got the following exception.
> > Anybody knows how to fix this? Thanks.
> >
> > Ey-Chih Chow
> >
> > 
> >
> > 15/01/29 08:53:10 INFO storage.BlockManagerMasterActor: Registering
> block manager ip-10-10-8-191.us-west-2.compute.internal:47722 with 6.6 GB
> RAM Exception in thread "main" java.lang.reflect.InvocationTargetException
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at
> >
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
> > at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at
> >
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:265)
> > at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:94)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> > at
> >
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> > at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
> > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
> > at scala.Option.getOrElse(Option.scala:120)
> > at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
> > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1128)
> > at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:935)
> > at
> >
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:832)
> > at com.crowdstar.etl.ParseAndClean$.main(ParseAndClean.scala:109)
> > at com.crowdstar.etl.ParseAndClean.main(ParseAndClean.scala)
> > ... 6 more
> > 15/01/29 08:54:33 INFO storage.BlockManager: Removing RDD 1
> > 15/01/29 08:54:33 ERROR actor.ActorSystemImpl: exception on LARS’ timer
> thread
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at
> >
> akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:397)
> > at
> akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
> > at java.lang.Thread.run(Thread.java:745)
> > 15/01/29 08:54:33 ERROR actor.ActorSystemImpl: Uncaught fatal error from
> thread [sparkDriver-scheduler-1] shutting down ActorSystem [sparkDriver]
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> > at
> >
> akka.actor.LightArrayRevolverScheduler$$a

Re: spark with cdh 5.2.1

2015-01-30 Thread Sean Owen
There is no need for a 2.5 profile. The hadoop-2.4 profile is for
Hadoop 2.4 and beyond. You can set the particular version you want
with -Dhadoop.version=

You do not need to make any new profile to compile vs 2.5.0-cdh5.2.1.
Again, the hadoop-2.4 profile is what you need.

On Thu, Jan 29, 2015 at 11:33 PM, Mohit Jaggi  wrote:
> Hi All,
> I noticed in pom.xml that there is no entry for Hadoop 2.5. Has anyone tried 
> Spark with 2.5.0-cdh5.2.1? Will replicating the 2.4 entry be sufficient to 
> make this work?
>
> Mohit.
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Hi: hadoop 2.5 for spark

2015-01-30 Thread fightf...@163.com
Hi, Siddharth
You can re build spark with maven by specifying -Dhadoop.version=2.5.0

Thanks,
Sun.



fightf...@163.com
 
From: Siddharth Ubale
Date: 2015-01-30 15:50
To: user@spark.apache.org
Subject: Hi: hadoop 2.5 for spark
Hi ,
 
I am beginner with Apache spark.
 
Can anyone let me know if it is mandatory to build spark with the Hadoop 
version I am using or can I use a pre built package and use it with my existing 
HDFS root folder?
I am using Hadoop 2.5.0 and want to use Apache spark 1.2.0 with it.
I could see a pre built version for 2.4 and above in the downbloads section of 
Spark homepage -> downloads.
 
Siddharth Ubale,
Synchronized Communications 
#43, Velankani Tech Park, Block No. II, 
3rd Floor, Electronic City Phase I,
Bangalore – 560 100
Tel : +91 80 3202 4060
Web: www.syncoms.com
London|Bangalore|Orlando
 
we innovate, plan, execute, and transform the business​
 
邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
共有 1 个附件
image001.jpg(3K) 极速下载 在线预览 


Re: Hi: hadoop 2.5 for spark

2015-01-30 Thread bit1...@163.com
You can use prebuilt version that is built upon hadoop2.4.




From: Siddharth Ubale
Date: 2015-01-30 15:50
To: user@spark.apache.org
Subject: Hi: hadoop 2.5 for spark
Hi ,
 
I am beginner with Apache spark.
 
Can anyone let me know if it is mandatory to build spark with the Hadoop 
version I am using or can I use a pre built package and use it with my existing 
HDFS root folder?
I am using Hadoop 2.5.0 and want to use Apache spark 1.2.0 with it.
I could see a pre built version for 2.4 and above in the downbloads section of 
Spark homepage -> downloads.
 
Siddharth Ubale,
Synchronized Communications 
#43, Velankani Tech Park, Block No. II, 
3rd Floor, Electronic City Phase I,
Bangalore – 560 100
Tel : +91 80 3202 4060
Web: www.syncoms.com
London|Bangalore|Orlando
 
we innovate, plan, execute, and transform the business​
 
邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
共有 1 个附件
image001.jpg(3K) 极速下载 在线预览