spark cluster performance decreases by adding more nodes

2017-05-17 Thread Junaid Nasir
I have a large data set of 1B records and want to run analytics using
Apache spark because of the scaling it provides, but I am seeing an anti
pattern here. The more nodes I add to spark cluster, completion time
increases. Data store is Cassandra, and queries are run by Zeppelin. I have
tried many different queries but even a simple query of `dataframe.count()`
behaves like this.

Here is the zeppelin notebook, temp table has 18M records

val df = sqlContext

 .read

 .format("org.apache.spark.sql.cassandra")

 .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))

 .load().cache()

   df.registerTempTable("table")

%sql

SELECT first(devid),date,count(1) FROM table group by date,rtu order by date


when tested against different no. of spark worker nodes these were the
results
Spark nodesTime
4 nodes 22 min 58 sec
3 nodes 15 min 49 sec
2 nodes 12 min 51 sec
1 node 17 min 59 sec

Increasing the no. of nodes decreases performance. which should not happen
as it defeats the purpose of using Spark.

If you want me to run any query or further info about the setup please ask.
Any cues on why this is happening would be very helpful, been stuck on this
for two days now. Thank you for your time.


***versions***

Zeppelin: 0.7.1
Spark: 2.1.0
Cassandra: 2.2.9
Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11

*Spark cluster specs*

6 vCPUs, 32 GB memory = 1 node

*Cassandra + Zeppelin server specs*
8 vCPUs, 52 GB memory


Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread darren
Maybe your master or zeppelin server is running out of memory and the more data 
it receives the more memory swapping it has to dosomething to check.




Get Outlook for Android







On Wed, May 17, 2017 at 11:14 AM -0400, "Junaid Nasir"  wrote:










I have a large data set of 1B records and want to run analytics using Apache 
spark because of the scaling it provides, but I am seeing an anti pattern here. 
The more nodes I add to spark cluster, completion time increases. Data store is 
Cassandra, and queries are run by Zeppelin. I have tried many different queries 
but even a simple query of `dataframe.count()` behaves like this. 
Here is the zeppelin notebook, temp table has 18M records 


val df = sqlContext

  .read

  .format("org.apache.spark.sql.cassandra")

  .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))

  .load().cache()

    df.registerTempTable("table")


%sql 

SELECT first(devid),date,count(1) FROM table group by date,rtu order by date

when tested against different no. of spark worker nodes these were the 
resultsSpark nodesTime4 nodes22 min 58 sec3 nodes15 min 49 sec2 nodes12 min 51 
sec1 node17 min 59 sec
Increasing the no. of nodes decreases performance. which should not happen as 
it defeats the purpose of using Spark. 
If you want me to run any query or further info about the setup please ask.Any 
cues on why this is happening would be very helpful, been stuck on this for two 
days now. Thank you for your time.

**versions**
Zeppelin: 0.7.1Spark: 2.1.0Cassandra: 2.2.9Connector: 
datastax:spark-cassandra-connector:2.0.1-s_2.11
Spark cluster specs
6 vCPUs, 32 GB memory = 1 node
Cassandra + Zeppelin server specs8 vCPUs, 52 GB memory








Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread Jörn Franke
The issue might be group by , which under certain circumstances can cause a lot 
of traffic to one node. This transfer is of course obsolete the less nodes you 
have.
Have you checked in the UI what it reports?

> On 17. May 2017, at 17:13, Junaid Nasir  wrote:
> 
> I have a large data set of 1B records and want to run analytics using Apache 
> spark because of the scaling it provides, but I am seeing an anti pattern 
> here. The more nodes I add to spark cluster, completion time increases. Data 
> store is Cassandra, and queries are run by Zeppelin. I have tried many 
> different queries but even a simple query of `dataframe.count()` behaves like 
> this. 
> 
> Here is the zeppelin notebook, temp table has 18M records 
> 
> val df = sqlContext
>   .read
>   .format("org.apache.spark.sql.cassandra")
>   .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
>   .load().cache()
> df.registerTempTable("table")
> 
> %sql 
> SELECT first(devid),date,count(1) FROM table group by date,rtu order by date
> 
> 
> when tested against different no. of spark worker nodes these were the results
> Spark nodes   Time
> 4 nodes   22 min 58 sec
> 3 nodes   15 min 49 sec
> 2 nodes   12 min 51 sec
> 1 node17 min 59 sec
> 
> Increasing the no. of nodes decreases performance. which should not happen as 
> it defeats the purpose of using Spark. 
> 
> If you want me to run any query or further info about the setup please ask.
> Any cues on why this is happening would be very helpful, been stuck on this 
> for two days now. Thank you for your time.
> 
> 
> **versions**
> 
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
> 
> Spark cluster specs
> 
> 6 vCPUs, 32 GB memory = 1 node
> 
> Cassandra + Zeppelin server specs
> 8 vCPUs, 52 GB memory
> 


Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-17 Thread kant kodali
Thanks Mike & Ryan. Now I can finally see my 5KB messages. However I am
running into the following error.

OpenJDK 64-Bit Server VM warning: INFO:
os::commit_memory(0x00073470, 530579456, 0) failed; error='Cannot
allocate memory' (errno=12)

# There is insufficient memory for the Java Runtime Environment to continue.

# Native memory allocation (mmap) failed to map 530579456 bytes for
committing reserved memory.
# An error report file with more information is saved as:


I am running spark driver program in the client mode on a standalone
cluster using spark 2.1.1. When things happen like this I wonder which
memory I need to increase and how? Should I increase the driver JVM memory
or executor JVM memory?

On Tue, May 16, 2017 at 4:34 PM, Michael Armbrust 
wrote:

> I mean the actual kafka client:
>
> 
>   org.apache.kafka
>   kafka-clients
>   0.10.0.1
> 
>
>
> On Tue, May 16, 2017 at 4:29 PM, kant kodali  wrote:
>
>> Hi Michael,
>>
>> Thanks for the catch. I assume you meant
>> *spark-streaming-kafka-0-10_2.11-2.1.0.jar*
>>
>> I add this in all spark machines under SPARK_HOME/jars.
>>
>> Still same error seems to persist. Is that the right jar or is there
>> anything else I need to add?
>>
>> Thanks!
>>
>>
>>
>> On Tue, May 16, 2017 at 1:40 PM, Michael Armbrust > > wrote:
>>
>>> Looks like you are missing the kafka dependency.
>>>
>>> On Tue, May 16, 2017 at 1:04 PM, kant kodali  wrote:
>>>
 Looks like I am getting the following runtime exception. I am using
 Spark 2.1.0 and the following jars

 *spark-sql_2.11-2.1.0.jar*

 *spark-sql-kafka-0-10_2.11-2.1.0.jar*

 *spark-streaming_2.11-2.1.0.jar*


 Exception in thread "stream execution thread for [id = 
 fcfe1fa6-dab3-4769-9e15-e074af622cc1, runId = 
 7c54940a-e453-41de-b256-049b539b59b1]"

 java.lang.NoClassDefFoundError: 
 org/apache/kafka/common/serialization/ByteArrayDeserializer
 at 
 org.apache.spark.sql.kafka010.KafkaSourceProvider.createSource(KafkaSourceProvider.scala:74)
 at 
 org.apache.spark.sql.execution.datasources.DataSource.createSource(DataSource.scala:245)
 at 
 org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$logicalPlan$1.applyOrElse(StreamExecution.scala:127)
 at 
 org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$logicalPlan$1.applyOrElse(StreamExecution.scala:123)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:288)
 at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)


 On Tue, May 16, 2017 at 10:30 AM, Shixiong(Ryan) Zhu <
 shixi...@databricks.com> wrote:

> The default "startingOffsets" is "latest". If you don't push any data
> after starting the query, it won't fetch anything. You can set it to
> "earliest" like ".option("startingOffsets", "earliest")" to start the
> stream from the beginning.
>
> On Tue, May 16, 2017 at 12:36 AM, kant kodali 
> wrote:
>
>> Hi All,
>>
>> I have the following code.
>>
>>  val ds = sparkSession.readStream()
>> .format("kafka")
>> .option("kafka.bootstrap.servers",bootstrapServers))
>> .option("subscribe", topicName)
>> .option("checkpointLocation", hdfsCheckPointDir)
>> .load();
>>
>>  val ds1 = ds.select($"value")
>>  val query = 
>> ds1.writeStream.outputMode("append").format("console").start()
>>  query.awaitTermination()
>>
>> There are no errors when I execute this code however I don't see any
>> data being printed out to console? When I run my standalone test Kafka
>> consumer jar I can see that it is receiving messages. so I am not sure 
>> what
>> is going on with above code? any ideas?
>>
>> Thanks!
>>
>
>

>>>
>>
>


Re: Cloudera 5.8.0 and spark 2.1.1

2017-05-17 Thread Arkadiusz Bicz
It is working fine,  but it is not supported by Cloudera.



On May 17, 2017 1:30 PM, "issues solution" 
wrote:

> Hi ,
>  it s possible to use prebuilt version of spark 2.1  inside cloudera 5.8
> where scala 2.1.0 not scala 2.1.1 and java 1.7 not java 1.8
>
> Why ?
>
> i am in corporate area and i want to test last version  of spark.
> but my probleme i dont Know if the version 2.1.1 of spark can or not work
> with this version of cloudera . i mean prebuilt version not source i don't
> have admin rights
>
> actual version of myspark it 1.6  and scala 2.1.0 and Hadoop
> 2.6.0-cdh5.8.0 and Hive 1.1.0-cdh5.8.0
>
> thx a lot
>
>
>


Spark Launch programatically - Basics!

2017-05-17 Thread Nipun Arora
Hi,

I am trying to get a simple spark application to run programatically. I
looked at
http://spark.apache.org/docs/2.1.0/api/java/index.html?org/apache/spark/launcher/package-summary.html,
at the following code.

   public class MyLauncher {
 public static void main(String[] args) throws Exception {
   SparkAppHandle handle = new SparkLauncher()
 .setAppResource("/my/app.jar")
 .setMainClass("my.spark.app.Main")
 .setMaster("local")
 .setConf(SparkLauncher.DRIVER_MEMORY, "2g")
 .startApplication();
   // Use handle API to monitor / control application.
 }
   }


I don't have any errors in running this for my application, but I am
running spark in local mode and the launcher class immediately exits after
executing this function. Are we supposed to wait for the process state etc.

Is there a more detailed example of how to monitor inputstreams etc. any
github link or blogpost would help.

Thanks
Nipun


Re: spark cluster performance decreases by adding more nodes

2017-05-17 Thread ayan guha
How many nodes do you have in casandra cluster?

On Thu, 18 May 2017 at 1:33 am, Jörn Franke  wrote:

> The issue might be group by , which under certain circumstances can cause
> a lot of traffic to one node. This transfer is of course obsolete the less
> nodes you have.
> Have you checked in the UI what it reports?
>
> On 17. May 2017, at 17:13, Junaid Nasir  wrote:
>
> I have a large data set of 1B records and want to run analytics using
> Apache spark because of the scaling it provides, but I am seeing an anti
> pattern here. The more nodes I add to spark cluster, completion time
> increases. Data store is Cassandra, and queries are run by Zeppelin. I have
> tried many different queries but even a simple query of `dataframe.count()`
> behaves like this.
>
> Here is the zeppelin notebook, temp table has 18M records
>
> val df = sqlContext
>
>  .read
>
>  .format("org.apache.spark.sql.cassandra")
>
>  .options(Map( "table" -> "temp", "keyspace" -> "mykeyspace"))
>
>  .load().cache()
>
>df.registerTempTable("table")
>
> %sql
>
> SELECT first(devid),date,count(1) FROM table group by date,rtu order by
> date
>
>
> when tested against different no. of spark worker nodes these were the
> results
> Spark nodesTime
> 4 nodes 22 min 58 sec
> 3 nodes 15 min 49 sec
> 2 nodes 12 min 51 sec
> 1 node 17 min 59 sec
>
> Increasing the no. of nodes decreases performance. which should not happen
> as it defeats the purpose of using Spark.
>
> If you want me to run any query or further info about the setup please ask.
> Any cues on why this is happening would be very helpful, been stuck on
> this for two days now. Thank you for your time.
>
>
> ***versions***
>
> Zeppelin: 0.7.1
> Spark: 2.1.0
> Cassandra: 2.2.9
> Connector: datastax:spark-cassandra-connector:2.0.1-s_2.11
>
> *Spark cluster specs*
>
> 6 vCPUs, 32 GB memory = 1 node
>
> *Cassandra + Zeppelin server specs*
> 8 vCPUs, 52 GB memory
>
> --
Best Regards,
Ayan Guha


How to flatten struct into a dataframe?

2017-05-17 Thread kant kodali
Hi,

I have the following schema. And I am trying to put the structure below in
a data frame or dataset such that each in field inside a struct is a column
in a data frame.
I tried to follow this link

and
did the following.

Dataset df = ds.select(functions.from_json(new Column("value").cast(
"string"), getSchema()).as("payload"));

Dataset df1 = df.select(df.col("payload.info"));
df1.printSchema();


root
 |-- info: struct (nullable = true)
 ||-- index: string (nullable = true)
 ||-- type: string (nullable = true)
 ||-- id: string (nullable = true)
 ||-- name: string (nullable = true)
 ||-- number: integer (nullable = true)


However I get the following

++
|info|
++
|[,mango,,fruit...|
|[,apple,,fruit...|

I just want the data frame in the format below. any ideas?

index | type | id | name | number

Thanks!


spark ML Recommender program

2017-05-17 Thread Arun
hi

I am writing spark ML Movie Recomender program on Intelij on windows10
Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory

When I set number of iteration 10 works fine
When I set number of Iteration 20 I get StackOverFlow error..
Whats the solution?..

thanks





Sent from Samsung tablet


Re: Spark <--> S3 flakiness

2017-05-17 Thread lucas.g...@gmail.com
Steve, just to clarify:

"FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is
way better on high-performance reads, especially if you are working with
column data and can set the fs.s3a.experimental.fadvise=random option. "

Are you talking about the hadoop-aws lib or hadoop itself.  I see that
spark is currently only pre-built against hadoop 2.7.

Most of our failures are on write, the other fix I've seen advertised has
been: "fileoutputcommitter.algorithm.version=2"

Still doing some reading and will start testing in the next day or so.

Thanks!

Gary

On 17 May 2017 at 03:19, Steve Loughran  wrote:

>
> On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote:
>
> Steve, thanks for the reply.  Digging through all the documentation now.
>
> Much appreciated!
>
>
>
> FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is
> way better on high-performance reads, especially if you are working with
> column data and can set the fs.s3a.experimental.fadvise=random option.
>
> That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest
> versions of CDH, even if their docs don't mention it
>
> https://hortonworks.github.io/hdp-aws/s3-performance/
> https://www.cloudera.com/documentation/enterprise/5-9-
> x/topics/spark_s3.html
>
>
> On 16 May 2017 at 10:10, Steve Loughran  wrote:
>
>>
>> On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote:
>>
>> Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
>> intermediate steps and final output of parquet files.
>>
>>
>> Please don't, not without a committer specially written to work against
>> S3 in the presence of failures.You are at risk of things going wrong and
>> you not even noticing.
>>
>> The only one that I trust to do this right now is;
>> https://github.com/rdblue/s3committer
>>
>>
>> see also : https://github.com/apache/spark/blob/master/docs/cloud-int
>> egration.md
>>
>>
>>
>> We're running into the following issues on a semi regular basis:
>> * These are intermittent errors, IE we have about 300 jobs that run
>> nightly... And a fairly random but small-ish percentage of them fail with
>> the following classes of errors.
>>
>>
>> *S3 write errors *
>>
>>> "ERROR Utils: Aborting task
>>> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404,
>>> AWS Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
>>> Error Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>>>
>>
>>
>>> "Py4JJavaError: An error occurred while calling o43.parquet.
>>> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
>>> Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
>>> Error Message: One or more objects could not be deleted, S3 Extended
>>> Request ID: null"
>>
>>
>>
>>
>> *S3 Read Errors: *
>>
>>> [Stage 1:=>   (27 +
>>> 4) / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage
>>> 1.0 (TID 11)
>>> java.net.SocketException: Connection reset
>>> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>>> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>>> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>>> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>>> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>>> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>>> at org.apache.http.impl.io.AbstractSessionInputBuffer.read(Abst
>>> ractSessionInputBuffer.java:198)
>>> at org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:178)
>>> at org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:200)
>>> at org.apache.http.impl.io.ContentLengthInputStream.close(Conte
>>> ntLengthInputStream.java:103)
>>> at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicMa
>>> nagedEntity.java:168)
>>> at org.apache.http.conn.EofSensorInputStream.checkClose(EofSens
>>> orInputStream.java:228)
>>> at org.apache.http.conn.EofSensorInputStream.close(EofSensorInp
>>> utStream.java:174)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>>> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream
>>> .java:187)
>>
>>
>>
>> We have literally tons of logs we can add but it would make the email
>> unwieldy big.  If it would be helpful I'll drop them in a pastebin or
>> something.
>>
>> Our config is along the lines of:
>>
>>- spark-2.1.0-bin-hadoop2.7

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Kun Liu
What is the problem here?

To use Toree there are some step ups.

Best,

On Wed, May 17, 2017 at 7:22 PM, upendra 1991  wrote:

> What's the best way to use jupyter with Scala spark. I tried Apache toree
> and created a kernel but did not get it working. I believe there is a
> better way.
>
> Please suggest any best practices.
>
> Sent from Yahoo Mail on Android
> 
>


Re: scalastyle violation on mvn install but not on mvn package

2017-05-17 Thread Marcelo Vanzin
scalastyle runs on the "verify" phase, which is after package but
before install.

On Wed, May 17, 2017 at 5:47 PM, yiskylee  wrote:
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> package
> works, but
> ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
> install
> triggers scalastyle violation error.
>
> Is the scalastyle check not used on package but only on install? To install,
> should I turn off "failOnViolation" in the pom?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-but-not-on-mvn-package-tp28693.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to see the full contents of dataset or dataframe is structured streaming?

2017-05-17 Thread kant kodali
Hi All,

How to see the full contents of dataset or dataframe is structured
streaming just like we normally with *df.show(false)*? Is there any
parameter I can pass in to the code below?

val df1 = df.selectExpr("payload.data.*");

df1.writeStream().outputMode("append").format("console").start()


Thanks!


Re: Jupyter spark Scala notebooks

2017-05-17 Thread Mark Vervuurt
Hi Upendra, 

I got toree to work and I described it in the following JIRA issue. 
See the last comment of the issue.
https://issues.apache.org/jira/browse/TOREE-336 
 

Mark

> On 18 May 2017, at 04:22, upendra 1991  wrote:
> 
> What's the best way to use jupyter with Scala spark. I tried Apache toree and 
> created a kernel but did not get it working. I believe there is a better way.
> 
> Please suggest any best practices.
> 
> Sent from Yahoo Mail on Android 
> 



Jupyter spark Scala notebooks

2017-05-17 Thread upendra 1991
What's the best way to use jupyter with Scala spark. I tried Apache toree and 
created a kernel but did not get it working. I believe there is a better way.
Please suggest any best practices.

Sent from Yahoo Mail on Android

Re: Jupyter spark Scala notebooks

2017-05-17 Thread Richard Moorhead
Take a look at Apache Zeppelin; it has both python and scala interpreters.
https://zeppelin.apache.org/

Apache Zeppelin
zeppelin.apache.org
Apache Zeppelin. A web-based notebook that enables interactive data analytics. 
You can make beautiful data-driven, interactive and collaborative documents 
with SQL ...






. . . . . . . . . . . . . . . . . . . . . . . . . . .

Richard Moorhead
Software Engineer
richard.moorh...@c2fo.com

C2FO: The World's Market for Working Capital®

[http://c2fo.com/wp-content/uploads/sites/1/2016/03/LinkedIN.png] 

 [http://c2fo.com/wp-content/uploads/sites/1/2016/03/YouTube.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Twitter.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Googleplus.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Facebook.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Forbes-Fintech-50.png] 


The information contained in this message and any attachment may be privileged, 
confidential, and protected from disclosure. If you are not the intended 
recipient, or an employee, or agent responsible for delivering this message to 
the intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication is strictly prohibited. If you 
have received this communication in error, please notify us immediately by 
replying to the message and deleting from your computer.



From: upendra 1991 
Sent: Wednesday, May 17, 2017 9:22:14 PM
To: user@spark.apache.org
Subject: Jupyter spark Scala notebooks

What's the best way to use jupyter with Scala spark. I tried Apache toree and 
created a kernel but did not get it working. I believe there is a better way.

Please suggest any best practices.

Sent from Yahoo Mail on 
Android


unsubscribe

2017-05-17 Thread suyash kharade



Re: spark ML Recommender program

2017-05-17 Thread Kun Liu
Hi Arun,

Would you like to show us the codes?

On Wed, May 17, 2017 at 8:15 PM, Arun  wrote:

> hi
>
> I am writing spark ML Movie Recomender program on Intelij on windows10
> Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory
>
> When I set number of iteration 10 works fine
> When I set number of Iteration 20 I get StackOverFlow error..
> Whats the solution?..
>
> thanks
>
>
>
>
>
> Sent from Samsung tablet
>


Re: spark ML Recommender program

2017-05-17 Thread Mark Vervuurt
If you are running locally try increasing driver memory to for example 4G en 
executor memory to 3G.
Regards, Mark

> On 18 May 2017, at 05:15, Arun  > wrote:
> 
> hi
> 
> I am writing spark ML Movie Recomender program on Intelij on windows10
> Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory
> 
> When I set number of iteration 10 works fine
> When I set number of Iteration 20 I get StackOverFlow error..
> Whats the solution?..
> 
> thanks
> 
> 
> 
> 
> 
> Sent from Samsung tablet







scalastyle violation on mvn install but not on mvn package

2017-05-17 Thread yiskylee
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
package 
works, but 
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean
install 
triggers scalastyle violation error. 

Is the scalastyle check not used on package but only on install? To install,
should I turn off "failOnViolation" in the pom? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scalastyle-violation-on-mvn-install-but-not-on-mvn-package-tp28693.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Jupyter spark Scala notebooks

2017-05-17 Thread Stephen Boesch
Jupyter with toree works well for my team.  Jupyter is well more refined vs
zeppelin as far as notebook features and usability: shortcuts, editing,etc.
  The caveat is it is better to run a separate server instanace for
python/pyspark vs scala/spark

2017-05-17 19:27 GMT-07:00 Richard Moorhead :

> Take a look at Apache Zeppelin; it has both python and scala interpreters.
> https://zeppelin.apache.org/
> Apache Zeppelin 
> zeppelin.apache.org
> Apache Zeppelin. A web-based notebook that enables interactive data
> analytics. You can make beautiful data-driven, interactive and
> collaborative documents with SQL ...
>
>
>
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . .
>
> Richard Moorhead
> Software Engineer
> richard.moorh...@c2fo.com 
>
> C2FO: The World's Market for Working Capital®
>
>
> 
>     
> 
> 
> 
>
> The information contained in this message and any attachment may be
> privileged, confidential, and protected from disclosure. If you are not the
> intended recipient, or an employee, or agent responsible for delivering
> this message to the intended recipient, you are hereby notified that any
> dissemination, distribution, or copying of this communication is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by replying to the message and deleting from your computer.
>
> --
> *From:* upendra 1991 
> *Sent:* Wednesday, May 17, 2017 9:22:14 PM
> *To:* user@spark.apache.org
> *Subject:* Jupyter spark Scala notebooks
>
> What's the best way to use jupyter with Scala spark. I tried Apache toree
> and created a kernel but did not get it working. I believe there is a
> better way.
>
> Please suggest any best practices.
>
> Sent from Yahoo Mail on Android
> 
>


Re: Jupyter spark Scala notebooks

2017-05-17 Thread kanth909
Which of these notebooks can help populate real time graphs through web socket 
or some sort of push mechanism?

Sent from my iPhone

> On May 17, 2017, at 8:50 PM, Stephen Boesch  wrote:
> 
> Jupyter with toree works well for my team.  Jupyter is well more refined vs 
> zeppelin as far as notebook features and usability: shortcuts, editing,etc.   
> The caveat is it is better to run a separate server instanace for 
> python/pyspark vs scala/spark
> 
> 2017-05-17 19:27 GMT-07:00 Richard Moorhead :
>> Take a look at Apache Zeppelin; it has both python and scala interpreters.
>> https://zeppelin.apache.org/
>> 
>> Apache Zeppelin
>> zeppelin.apache.org
>> Apache Zeppelin. A web-based notebook that enables interactive data 
>> analytics. You can make beautiful data-driven, interactive and collaborative 
>> documents with SQL ...
>> 
>> 
>> 
>> 
>> . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> 
>> Richard Moorhead
>> Software Engineer
>> richard.moorh...@c2fo.com
>> C2FO: The World's Market for Working Capital®
>>  
>> The information contained in this message and any attachment may be 
>> privileged, confidential, and protected from disclosure. If you are not the 
>> intended recipient, or an employee, or agent responsible for delivering this 
>> message to the intended recipient, you are hereby notified that any 
>> dissemination, distribution, or copying of this communication is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> us immediately by replying to the message and deleting from your computer.
>> 
>> 
>> From: upendra 1991 
>> Sent: Wednesday, May 17, 2017 9:22:14 PM
>> To: user@spark.apache.org
>> Subject: Jupyter spark Scala notebooks
>>  
>> What's the best way to use jupyter with Scala spark. I tried Apache toree 
>> and created a kernel but did not get it working. I believe there is a better 
>> way.
>> 
>> Please suggest any best practices.
>> 
>> Sent from Yahoo Mail on Android
> 


Re: checkpointing without streaming?

2017-05-17 Thread Tathagata Das
Why not just save the RDD to a proper file? text file, sequence, file, many
options. Then its standard to read it back in different program.

On Wed, May 17, 2017 at 12:01 AM, neelesh.sa 
wrote:

> Is it possible to checkpoint a RDD in one run of my application and use the
> saved RDD in the next run of my application?
>
> For example, with the following code:
> val x = List(1,2,3,4)
> val y = sc.parallelize(x ,2).map( c => c*2)
> y.checkpoint
> y.count
>
> Is it possible to read the checkpointed RDD in another application?
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-17 Thread Mendelson, Assaf
Thanks for the response.
I will try with log4j.
That said, I am running in windows using winutil.exe and still getting the 
warning.

Thanks,
  Assaf.

From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Tuesday, May 16, 2017 6:55 PM
To: Mendelson, Assaf
Cc: user@spark.apache.org
Subject: Re: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable


On 10 May 2017, at 13:40, Mendelson, Assaf 
> wrote:

Hi all,
When running spark I get the following warning: [WARN] 
org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Now I know that in general it is possible to ignore this warning, however, it 
means that utilities that catch “WARN” in the log keep flagging this.
I saw many answers to handling this (e.g. 
http://stackoverflow.com/questions/30369380/hadoop-unable-to-load-native-hadoop-library-for-your-platform-error-on-docker,
 
http://stackoverflow.com/questions/19943766/hadoop-unable-to-load-native-hadoop-library-for-your-platform-warning,http://stackoverflow.com/questions/40015416/spark-unable-to-load-native-hadoop-library-for-your-platform),
 however, I am unable to solve this on my local machine.
Specifically, I can’t find any such solution for windows (i.e. when running 
developer local builds) or on a centos 7 machine with no HDFS (basically it is 
a single node machine which uses spark standalone for testing).


Log4J is your friend. I usually have (at least)

log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN

if you are working on Windows though, you do actually need the native libraries 
an winutils.exe on your path, or things won't work


Any help would be appreciated.

Thanks,
  Assaf.



Re: checkpointing without streaming?

2017-05-17 Thread neelesh.sa
Is it possible to checkpoint a RDD in one run of my application and use the
saved RDD in the next run of my application?

For example, with the following code:
val x = List(1,2,3,4)
val y = sc.parallelize(x ,2).map( c => c*2)
y.checkpoint
y.count

Is it possible to read the checkpointed RDD in another application?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/checkpointing-without-streaming-tp4541p28691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Parquet file amazon s3a timeout

2017-05-17 Thread Karin Valisova
Hello!
I'm working with some parquet files saved on amazon service and loading
them to dataframe with

Dataset df = spark.read() .parquet(parketFileLocation);

however, after some time I get the "Timeout waiting for connection from
pool" exception. I hope I'm not mistaken, but I think that there's the
limitation for the length of any open connection with s3a, but I have
enough local memory to actually just load the file and close the connection.

Is it possible to specify some option when reading the parquet to store the
data locally and release the connection? Or any other ideas on how to solve
the problem?

Thank you very much,
have a nice day!
Karin


Re: Spark <--> S3 flakiness

2017-05-17 Thread Steve Loughran

On 17 May 2017, at 06:00, lucas.g...@gmail.com 
wrote:

Steve, thanks for the reply.  Digging through all the documentation now.

Much appreciated!



FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way 
better on high-performance reads, especially if you are working with column 
data and can set the fs.s3a.experimental.fadvise=random option.

That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest versions 
of CDH, even if their docs don't mention it

https://hortonworks.github.io/hdp-aws/s3-performance/
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/spark_s3.html


On 16 May 2017 at 10:10, Steve Loughran 
> wrote:

On 11 May 2017, at 06:07, lucas.g...@gmail.com 
wrote:

Hi users, we have a bunch of pyspark jobs that are using S3 for loading / 
intermediate steps and final output of parquet files.

Please don't, not without a committer specially written to work against S3 in 
the presence of failures.You are at risk of things going wrong and you not even 
noticing.

The only one that I trust to do this right now is; 
https://github.com/rdblue/s3committer


see also : https://github.com/apache/spark/blob/master/docs/cloud-integration.md



We're running into the following issues on a semi regular basis:
* These are intermittent errors, IE we have about 300 jobs that run nightly... 
And a fairly random but small-ish percentage of them fail with the following 
classes of errors.

S3 write errors

"ERROR Utils: Aborting task
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS 
Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error 
Message: Not Found, S3 Extended Request ID: BlaBlahEtc="

"Py4JJavaError: An error occurred while calling o43.parquet.
: com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, 
AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error 
Message: One or more objects could not be deleted, S3 Extended Request ID: null"


S3 Read Errors:

[Stage 1:=>   (27 + 4) / 
31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 
11)
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at 
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at 
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
at 
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at 
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)


We have literally tons of logs we can add but it would make the email unwieldy 
big.  If it would be helpful I'll drop them in a pastebin or something.

Our config is along the lines of:

  *   spark-2.1.0-bin-hadoop2.7
  *   '--packages 
com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 
pyspark-shell'

You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't ready to 
play with. In particular, in a close() call it reads to the end of the stream, 
which is a performance killer on large files. That stack trace you see is from 
that same phase of operation, so should go away too.

Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one will 
probably cause link errors.
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3

Also: make sure Joda time >= 2.8.1 for Java 8

If you go up to 2.8.0, and you still see the errors, file something against 
HADOOP in JIRA


Given the stack overflow / googling 

Re: Parquet file amazon s3a timeout

2017-05-17 Thread Steve Loughran

On 17 May 2017, at 11:13, Karin Valisova 
> wrote:

Hello!
I'm working with some parquet files saved on amazon service and loading them to 
dataframe with

Dataset df = spark.read() .parquet(parketFileLocation);

however, after some time I get the "Timeout waiting for connection from pool" 
exception. I hope I'm not mistaken, but I think that there's the limitation for 
the length of any open connection with s3a, but I have enough local memory to 
actually just load the file and close the connection.

1 version of Hadoop binaries? You should be using Hadoop 2.7.x for S3a to start 
working properly (see https://issues.apache.org/jira/browse/HADOOP-11571 for 
the list of issues)

2. If you move up to 2.7 & still see the exception, can you paste the  full 
stack trace?


Is it possible to specify some option when reading the parquet to store the 
data locally and release the connection? Or any other ideas on how to solve the 
problem?


If the problem is still there with Hadoop 2.7 binaries, then there's some 
thread pool options related to the AWS transfer manager and some other pooling 
going on, as well as the setting fs.s3a.connection.maximum to play with


http://hadoop.apache.org/docs/r2.7.3/hadoop-aws/tools/hadoop-aws/index.html


though as usual, people are always finding new corner cases to deadlock. Here I 
suspect https://issues.apache.org/jira/browse/HADOOP-13826; which is fixed in 
Hadoop 2.8+

-Steve


Cloudera 5.8.0 and spark 2.1.1

2017-05-17 Thread issues solution
Hi ,
 it s possible to use prebuilt version of spark 2.1  inside cloudera 5.8
where scala 2.1.0 not scala 2.1.1 and java 1.7 not java 1.8

Why ?

i am in corporate area and i want to test last version  of spark.
but my probleme i dont Know if the version 2.1.1 of spark can or not work
with this version of cloudera . i mean prebuilt version not source i don't
have admin rights

actual version of myspark it 1.6  and scala 2.1.0 and Hadoop 2.6.0-cdh5.8.0
and Hive 1.1.0-cdh5.8.0

thx a lot


Re: s3 bucket access/read file

2017-05-17 Thread Steve Loughran

On 17 May 2017, at 00:10, jazzed 
> wrote:

How did you solve the problem with V4?


which v4 problem? Authentication?

you need to declare the explicit s3a endpoint via fs.s3a.endpoint , otherwise 
you get a generic "bad auth" message which is not a good place to start 
debugging from

full list here: https://hortonworks.github.io/hdp-aws/s3-configure/index.html





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/s3-bucket-access-read-file-tp23536p28688.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org