from:"Petar Zecevic"

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic



You can try out a few tricks employed by folks at Lynx Analytics... 
Daniel Darabos gave some details at Spark Summit:

https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs


On 22.7.2015. 17:00, Louis Hust wrote:

My code like below:
MapString, String t11opt = new HashMapString, String();
t11opt.put(url, DB_URL);
t11opt.put(dbtable, t11);
DataFrame t11 = sqlContext.load(jdbc, t11opt);
t11.registerTempTable(t11);

...the same for t12, t21, t22


DataFrame t1 = t11.unionAll(t12);
t1.registerTempTable(t1);
DataFrame t2 = t21.unionAll(t22);
t2.registerTempTable(t2);
for (int i = 0; i  10; i ++) {
System.out.println(new Date(System.currentTimeMillis()));
DataFrame crossjoin = sqlContext.sql(select txt from 
t1 join t2 on t1.id http://t1.id = t2.id http://t2.id);

crossjoin.show();
System.out.println(new Date(System.currentTimeMillis()));
}

Where t11,t12, t21,t22 are all table dataframe load from jdbc  of 
mysql database which is at local with the spark job.


But each loop execute about 3 seconds. i do not know why cost so many 
time?





2015-07-22 19:52 GMT+08:00 Robin East robin.e...@xense.co.uk 
mailto:robin.e...@xense.co.uk:


Here’s an example using spark-shell on my laptop:

sc.textFile(LICENSE).filter(_ contains Spark).count

This takes less than a second the first time I run it and is
instantaneous on every subsequent run.

What code are you running?



On 22 Jul 2015, at 12:34, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

I do a simple test using spark in standalone mode(not cluster),
 and found a simple action take a few seconds, the data size is
small, just few rows.
So each spark job will cost some time for init or prepare work no
matter what the job is?
I mean if the basic framework of spark job will cost seconds?

2015-07-22 19:17 GMT+08:00 Robin East robin.e...@xense.co.uk
mailto:robin.e...@xense.co.uk:

Real-time is, of course, relative but you’ve mentioned
microsecond level. Spark is designed to process large amounts
of data in a distributed fashion. No distributed system I
know of could give any kind of guarantees at the microsecond
level.

Robin

 On 22 Jul 2015, at 11:14, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

 Hi, all

 I am using spark jar in standalone mode, fetch data from
different mysql instance and do some action, but i found the
time is at second level.

 So i want to know if spark job is suitable for real time
query which at microseconds?

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic



Sorry about self-promotion, but there's a really nice tutorial for 
setting up Eclipse for Spark in Spark in Action book:

http://www.manning.com/bonaci/


On 27.7.2015. 10:22, Akhil Das wrote:
You can follow this doc 
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup


Thanks
Best Regards

On Fri, Jul 24, 2015 at 10:56 AM, Siva Reddy ksiv...@gmail.com 
mailto:ksiv...@gmail.com wrote:


Hi All,

I am trying to setup the Eclipse (LUNA)  with Maven so that I
create a
maven projects for developing spark programs.  I am having some
issues and I
am not sure what is the issue.


  Can Anyone share a nice step-step document to configure eclipse
with maven
for spark development.


Thanks
Siva



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Petar Zecevic



Sorry about self-promotion, but there's a really nice tutorial for 
setting up Eclipse for Spark in Spark in Action book:

http://www.manning.com/bonaci/


On 24.7.2015. 7:26, Siva Reddy wrote:

Hi All,

 I am trying to setup the Eclipse (LUNA)  with Maven so that I create a
maven projects for developing spark programs.  I am having some issues and I
am not sure what is the issue.


   Can Anyone share a nice step-step document to configure eclipse with maven
for spark development.


Thanks
Siva



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Eclipse-IDE-Maven-tp23977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is spark suitable for real time query

2015-07-28 Thread Petar Zecevic



You can try out a few tricks employed by folks at Lynx Analytics... 
Daniel Darabos gave some details at Spark Summit:

https://www.youtube.com/watch?v=zt1LdVj76LUindex=13list=PL-x35fyliRwhP52fwDqULJLOnqnrN5nDs


On 22.7.2015. 17:00, Louis Hust wrote:

My code like below:
MapString, String t11opt = new HashMapString, String();
t11opt.put(url, DB_URL);
t11opt.put(dbtable, t11);
DataFrame t11 = sqlContext.load(jdbc, t11opt);
t11.registerTempTable(t11);

...the same for t12, t21, t22


DataFrame t1 = t11.unionAll(t12);
t1.registerTempTable(t1);
DataFrame t2 = t21.unionAll(t22);
t2.registerTempTable(t2);
for (int i = 0; i  10; i ++) {
System.out.println(new Date(System.currentTimeMillis()));
DataFrame crossjoin = sqlContext.sql(select txt from 
t1 join t2 on t1.id http://t1.id = t2.id http://t2.id);

crossjoin.show();
System.out.println(new Date(System.currentTimeMillis()));
}

Where t11,t12, t21,t22 are all table dataframe load from jdbc  of 
mysql database which is at local with the spark job.


But each loop execute about 3 seconds. i do not know why cost so many 
time?





2015-07-22 19:52 GMT+08:00 Robin East robin.e...@xense.co.uk 
mailto:robin.e...@xense.co.uk:


Here’s an example using spark-shell on my laptop:

sc.textFile(LICENSE).filter(_ contains Spark).count

This takes less than a second the first time I run it and is
instantaneous on every subsequent run.

What code are you running?



On 22 Jul 2015, at 12:34, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

I do a simple test using spark in standalone mode(not cluster),
 and found a simple action take a few seconds, the data size is
small, just few rows.
So each spark job will cost some time for init or prepare work no
matter what the job is?
I mean if the basic framework of spark job will cost seconds?

2015-07-22 19:17 GMT+08:00 Robin East robin.e...@xense.co.uk
mailto:robin.e...@xense.co.uk:

Real-time is, of course, relative but you’ve mentioned
microsecond level. Spark is designed to process large amounts
of data in a distributed fashion. No distributed system I
know of could give any kind of guarantees at the microsecond
level.

Robin

 On 22 Jul 2015, at 11:14, Louis Hust louis.h...@gmail.com
mailto:louis.h...@gmail.com wrote:

 Hi, all

 I am using spark jar in standalone mode, fetch data from
different mysql instance and do some action, but i found the
time is at second level.

 So i want to know if spark job is suitable for real time
query which at microseconds?

Re: Fwd: Model weights of linear regression becomes abnormal values

2015-05-29 Thread Petar Zecevic



You probably need to scale the values in the data set so that they are 
all of comparable ranges and translate them so that their means get to 0.


You can use pyspark.mllib.feature.StandardScaler(True, True) object for 
that.


On 28.5.2015. 6:08, Maheshakya Wijewardena wrote:


Hi,

I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with 
the attached dataset. The code is attached. When I check the model 
weights vector after training, it contains `nan` values.

[nan,nan,nan,nan,nan,nan,nan,nan]
But for some data sets, this problem does not occur. What might be the reason 
for this?
Is this an issue with the data I'm using or a bug?
Best regards.
--
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 Lanka (Pvt) Ltd
Email: mahesha...@wso2.com mailto:mahesha...@wso2.com
Mobile: +94711228855/*
*/




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to configure SparkUI to use internal ec2 ip

2015-03-31 Thread Petar Zecevic



Did you try setting the SPARK_MASTER_IP parameter in spark-env.sh?


On 31.3.2015. 19:19, Anny Chen wrote:

Hi Akhil,

I tried editing the /etc/hosts on the master and on the workers, and 
seems it is not working for me.


I tried adding hostname internal-ip and it didn't work. I then 
tried adding internal-ip hostname and it didn't work either. I 
guess I should also edit the spark-env.sh file?


Thanks!
Anny

On Mon, Mar 30, 2015 at 11:15 PM, Akhil Das 
ak...@sigmoidanalytics.com mailto:ak...@sigmoidanalytics.com wrote:


You can add an internal ip to public hostname mapping in your
/etc/hosts file, if your forwarding is proper then it wouldn't be
a problem there after.



Thanks
Best Regards

On Tue, Mar 31, 2015 at 9:18 AM, anny9699 anny9...@gmail.com
mailto:anny9...@gmail.com wrote:

Hi,

For security reasons, we added a server between my aws Spark
Cluster and
local, so I couldn't connect to the cluster directly. To see
the SparkUI and
its related work's  stdout and stderr, I used dynamic
forwarding and
configured the SOCKS proxy. Now I could see the SparkUI using
the  internal
ec2 ip, however when I click on the application UI (4040) or
the worker's UI
(8081), it still automatically uses the public DNS instead of
internal ec2
ip, which the browser now couldn't show.

Is there a way that I could configure this? I saw that one
could configure
the LOCAL_ADDRESS_IP in the spark-env.sh, but not sure whether
this could
help. Does anyone experience the same issue?

Thanks a lot!
Anny




--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-configure-SparkUI-to-use-internal-ec2-ip-tp22311.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Re: Spark-submit and multiple files

2015-03-20 Thread Petar Zecevic



I tried your program in yarn-client mode and it worked with no 
exception. This is the command I used:


spark-submit --master yarn-client --py-files work.py main.py

(Spark 1.2.1)

On 20.3.2015. 9:47, Guillaume Charhon wrote:

Hi Davies,

I am already using --py-files. The system does use the other file. The 
error I am getting is not trivial. Please check the error log.




On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com 
mailto:dav...@databricks.com wrote:


You could submit additional Python source via --py-files , for
example:

$ bin/spark-submit --py-files work.py main.py

On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez
guilla...@databerries.com mailto:guilla...@databerries.com wrote:
 Hello guys,

 I am having a hard time to understand how spark-submit behave
with multiple
 files. I have created two code snippets. Each code snippet is
composed of a
 main.py and work.py. The code works if I paste work.py then
main.py in a
 pyspark shell. However both snippets do not work when using
spark submit and
 generate different errors.

 Function add_1 definition outside
 http://www.codeshare.io/4ao8B
 https://justpaste.it/jzvj

 Embedded add_1 function definition
 http://www.codeshare.io/OQJxq
 https://justpaste.it/jzvn

 I am trying a way to make it work.

 Thank you for your support.



 --
 View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-and-multiple-files-tp22097.html
 Sent from the Apache Spark User List mailing list archive at
Nabble.com.


-
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org

Re: Hamburg Apache Spark Meetup

2015-02-25 Thread Petar Zecevic



Please add the Zagreb Meetup group, too.

http://www.meetup.com/Apache-Spark-Zagreb-Meetup/

Thanks!

On 18.2.2015. 19:46, Johan Beisser wrote:

If you could also add the Hamburg Apache Spark Meetup, I'd appreciate it.

http://www.meetup.com/Hamburg-Apache-Spark-Meetup/

On Tue, Feb 17, 2015 at 5:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote:

Thanks! I've added you.

Matei


On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu 
ra...@the4thfloor.eu wrote:

Hi,


there is a small Spark Meetup group in Berlin, Germany :-)
http://www.meetup.com/Berlin-Apache-Spark-Meetup/

Plaes add this group to the Meetups list at
https://spark.apache.org/community.html


Ralph

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-25 Thread Petar Zecevic



I believe your class needs to be defined as a case class (as I answered 
on SO)..



On 25.2.2015. 5:15, anamika gupta wrote:

Hi Akhil

I guess it skipped my attention. I would definitely give it a try.

While I would still like to know what is the issue with the way I have 
created schema?


On Tue, Feb 24, 2015 at 4:35 PM, Akhil Das ak...@sigmoidanalytics.com 
mailto:ak...@sigmoidanalytics.com wrote:


Did you happen to have a look at

https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema

Thanks
Best Regards

On Tue, Feb 24, 2015 at 3:39 PM, anu anamika.guo...@gmail.com
mailto:anamika.guo...@gmail.com wrote:

My issue is posted here on stack-overflow. What am I doing
wrong here?


http://stackoverflow.com/questions/28689186/facing-error-while-extending-scala-class-with-product-interface-to-overcome-limi


View this message in context: Facing error while extending
scala class with Product interface to overcome limit of 22
fields in spark-shell

http://apache-spark-user-list.1001560.n3.nabble.com/Facing-error-while-extending-scala-class-with-Product-interface-to-overcome-limit-of-22-fields-in-spl-tp21787.html
Sent from the Apache Spark User List mailing list archive
http://apache-spark-user-list.1001560.n3.nabble.com/ at
Nabble.com.

Re: Accumulator in SparkUI for streaming

2015-02-24 Thread Petar Zecevic



Interesting. Accumulators are shown on Web UI if you are using the 
ordinary SparkContext (Spark 1.2). It just has to be named (and that's 
what you did).


scala val acc = sc.accumulator(0, test accumulator)
acc: org.apache.spark.Accumulator[Int] = 0
scala val rdd = sc.parallelize(1 to 1000)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at console:12

scala rdd.foreach(x = acc += 1)
scala acc.value
res1: Int = 1000

The Stage details page shows:




On 20.2.2015. 9:25, Tim Smith wrote:

On Spark 1.2:

I am trying to capture # records read from a kafka topic:

val inRecords = ssc.sparkContext.accumulator(0, InRecords)

..

kInStreams.foreach( k =
{

 k.foreachRDD ( rdd =  inRecords += rdd.count().toInt  )
 inRecords.value


Question is how do I get the accumulator to show up in the UI? I tried 
inRecords.value but that didn't help. Pretty sure it isn't showing 
up in Stage metrics.


What's the trick here? collect?

Thanks,

Tim

Re: Posting to the list

2015-02-21 Thread Petar Zecevic



The message went through after all. Sorry for spamming.


On 21.2.2015. 21:27, pzecevic wrote:

Hi Spark users.

Does anybody know what are the steps required to be able to post to this
list by sending an email to user@spark.apache.org? I just sent a reply to
Corey Nolet's mail Missing shuffle files but I don't think it was accepted
by the engine.

If I look at the Spark user list, I don't see this topic (Missing shuffle
files) at all: http://apache-spark-user-list.1001560.n3.nabble.com/

I can see it in the archives, though:
https://mail-archives.apache.org/mod_mbox/spark-user/201502.mbox/browser
but my answer is not there.

This is not the first time this happened and I am wondering what is going
on. The engine is eating my emails? It doesn't like me?
I am subscribed to the list and I have the Nabble account.
I previously saw one of my email marked with This message has not been
accepted by the mailing list yet. I read what that means, but I don't think
it applies to me.

What am I missing?

P.S.: I am posting this through the Nabble web interface. Hope it gets
through...




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Missing shuffle files

2015-02-21 Thread Petar Zecevic



Could you try to turn on the external shuffle service?

spark.shuffle.service.enable= true


On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing 
that executors are being lost as well. Thing is, I can't figure out 
how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 
1.3TB of memory allocated for the application. I was thinking perhaps 
it was possible that a single executor was getting a single or a 
couple large partitions but shouldn't the disk persistence kick in at 
that point?


On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com 
mailto:arp...@spotify.com wrote:


For large jobs, the following error message is shown that seems to
indicate that shuffle files for some reason are missing. It's a
rather large job with many partitions. If the data size is
reduced, the problem disappears. I'm running a build from Spark
master post 1.2 (build at 2015-01-16) and running on Yarn 2.2. Any
idea of how to resolve this problem?

User class threw exception: Job aborted due to stage failure: Task
450 in stage 450.1 failed 4 times, most recent failure: Lost task
450.3 in stage 450.1 (TID 167370,
lon4-hadoopslave-b77.lon4.spotify.net
http://lon4-hadoopslave-b77.lon4.spotify.net):
java.io.FileNotFoundException:

/disk/hd06/yarn/local/usercache/arpteg/appcache/application_1424333823218_21217/spark-local-20150221154811-998c/03/rdd_675_450
(No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:171)
at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:76)
at
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:786)
at
org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:637)

at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:149)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:74)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:192)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

TIA,
Anders

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Petar Zecevic



You can enable YARN log aggregation (yarn.log-aggregation-enable to 
true) and execute command

yarn logs -applicationId your_application_id
after your application finishes.

Or you can look at them directly in HDFS in 
/tmp/logs/user/logs/applicationid/hostname


On 6.2.2015. 19:50, nitinkak001 wrote:

I am trying to debug my mapPartitionsFunction. Here is the code. There are
two ways I am trying to log using log.info() or println(). I am running in
yarn-cluster mode. While I can see the logs from driver code, I am not able
to see logs from map, mapPartition functions in the Application Tracking
URL. Where can I find the logs?

  /var outputRDD = partitionedRDD.mapPartitions(p = {
   val outputList = new ArrayList[scala.Tuple3[Long, Long, Int]]
   p.map({ case(key, value) = {
log.info(Inside map)
println(Inside map);
for(i - 0 until outputTuples.size()){
  val outputRecord = outputTuples.get(i)
  if(outputRecord != null){
outputList.add(outputRecord.getCurrRecordProfileID(),
outputRecord.getWindowRecordProfileID, outputRecord.getScore())
  }
}
 }
   })
   outputList.iterator()
 })/

Here is my log4j.properties

/log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
%c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO/




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Where-can-I-find-logs-set-inside-RDD-processing-functions-tp21537.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LeaseExpiredException while writing schemardd to hdfs

2015-02-05 Thread Petar Zecevic



Why don't you just map rdd's rows to lines and then call saveAsTextFile()?

On 3.2.2015. 11:15, Hafiz Mujadid wrote:

I want to write whole schemardd to single in hdfs but facing following
exception

rg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
No lease on /test/data/data1.csv (inode 402042): File does not exist. Holder
DFSClient_NONMAPREDUCE_-564238432_57 does not have any open files

here is my code
rdd.foreachPartition( iterator = {
   var output = new Path( outputpath )
   val fs = FileSystem.get( new Configuration() )
   var writer : BufferedWriter = null
   writer = new BufferedWriter( new OutputStreamWriter(  fs.create(
output ) ) )
   var line = new StringBuilder
   iterator.foreach( row = {
row.foreach( column = {
line.append( column.toString + splitter )
} )
writer.write( line.toString.dropRight( 1 ) )
writer.newLine()
line.clear
} )
writer.close()
} )

I think problem is that I am making writer for each partition and multiple
writer are executing in parallel so when they try to write to same file then
this problem appears.
When I avoid this approach then I face task not serializable exception

Any suggest to handle this problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/LeaseExpiredException-while-writing-schemardd-to-hdfs-tp21477.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic



Ok, thanks for the clarifications. I didn't know this list has to remain 
as the only official list.


Nabble is really not the best solution in the world, but we're stuck 
with it, I guess.


That's it from me on this subject.

Petar


On 22.1.2015. 3:55, Nicholas Chammas wrote:


I think a few things need to be laid out clearly:

 1. This mailing list is the “official” user discussion platform. That
is, it is sponsored and managed by the ASF.
 2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in Stack
Overflow under the |apache-spark| and related tags. Stack Overflow
works quite well.
 3. The ASF will not agree to deprecating or migrating this user list
to a platform that they do not control.
 4. This mailing list has grown to an unwieldy size and discussions
are hard to find or follow; discussion tooling is also lacking. We
want to improve the utility and user experience of this mailing list.
 5. We don’t want to fragment this “official” discussion community.
 6. Nabble is an independent product not affiliated with the ASF. It
offers a slightly better interface to the Apache mailing list
archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new questions, if
that’s possible) and redirect users to Stack Overflow (automatic
reply?).

From what I understand of the ASF’s policies, this is not possible. :( 
This mailing list must remain the official Spark user discussion platform.


Other thing, about new Stack Exchange site I proposed earlier. If
a new site is created, there is no problem with guidelines, I
think, because Spark community can apply different guidelines for
the new site.

I think Stack Overflow and the various Spark tags are working fine. I 
don’t see a compelling need for a Stack Exchange dedicated to Spark, 
either now or in the near future. Also, I doubt a Spark-specific site 
can pass the 4 tests in the Area 51 FAQ 
http://area51.stackexchange.com/faq:


  * Almost all Spark questions are on-topic for Stack Overflow
  * Stack Overflow already exists, it already has a tag for Spark, and
nobody is complaining
  * You’re not creating such a big group that you don’t have enough
experts to answer all possible questions
  * There’s a high probability that users of Stack Overflow would
enjoy seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some
project-specific sites, like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger than 
the Spark community is or likely ever will be, simply due to the 
nature of the problems they are solving.


What we need is an improvement to this mailing list. We need better 
tooling than Nabble to sit on top of the Apache archives, and we also 
need some way to control the volume and quality of mail on the list so 
that it remains a useful resource for the majority of users.


Nick



On Wed Jan 21 2015 at 3:13:21 PM pzecevic petar.zece...@gmail.com 
mailto:petar.zece...@gmail.com wrote:


Hi,
I tried to find the last reply by Nick Chammas (that I received in the
digest) using the Nabble web interface, but I cannot find it
(perhaps he
didn't reply directly to the user list?). That's one example of
Nabble's
usability.

Anyhow, I wanted to add my two cents...

Apache user group could be frozen (not accepting new questions, if
that's
possible) and redirect users to Stack Overflow (automatic reply?). Old
questions remain (and are searchable) on Nabble, new questions go
to Stack
Exchange, so no need for migration. That's the idea, at least, as
I'm not
sure if that's technically doable... Is it?
dev mailing list could perhaps stay on Nabble (it's not that
busy), or have
a special tag on Stack Exchange.

Other thing, about new Stack Exchange site I proposed earlier. If
a new site
is created, there is no problem with guidelines, I think, because
Spark
community can apply different guidelines for the new site.

There is a FAQ about creating new sites:
http://area51.stackexchange.com/faq
It says: Stack Exchange sites are free to create and free to use.
All we
ask is that you have an enthusiastic, committed group of expert
users who
check in regularly, asking and answering questions.
I think this requirement is satisfied...
Someone expressed a concern that they won't allow creating a
project-specific site, but there already exist some
project-specific sites,
like Tor, Drupal, Ubuntu...

Later, though, the FAQ also says:
If Y already exists, it already has a tag for X, and nobody is
complaining
(then you should not create a new

Re: Discourse: A proposed alternative to the Spark User list

2015-01-22 Thread Petar Zecevic



But voting is done on dev list, right? That could stay there...

Overlay might be a fine solution, too, but that still gives two user 
lists (SO and Nabble+overlay).



On 22.1.2015. 10:42, Sean Owen wrote:


Yes, there is some project business like votes of record on releases 
that needs to be carried on in standard, simple accessible place and 
SO is not at all suitable.


Nobody is stuck with Nabble. The suggestion is to enable a different 
overlay on the existing list. SO remains a place you can ask questions 
too. So I agree with Nick's take.


BTW are there perhaps plans to split this mailing list into 
subproject-specific lists? That might also help tune in/out the subset 
of conversations of interest.


On Jan 22, 2015 10:30 AM, Petar Zecevic petar.zece...@gmail.com 
mailto:petar.zece...@gmail.com wrote:



Ok, thanks for the clarifications. I didn't know this list has to
remain as the only official list.

Nabble is really not the best solution in the world, but we're
stuck with it, I guess.

That's it from me on this subject.

Petar


On 22.1.2015. 3:55, Nicholas Chammas wrote:


I think a few things need to be laid out clearly:

 1. This mailing list is the “official” user discussion platform.
That is, it is sponsored and managed by the ASF.
 2. Users are free to organize independent discussion platforms
focusing on Spark, and there is already one such platform in
Stack Overflow under the |apache-spark| and related tags.
Stack Overflow works quite well.
 3. The ASF will not agree to deprecating or migrating this user
list to a platform that they do not control.
 4. This mailing list has grown to an unwieldy size and
discussions are hard to find or follow; discussion tooling is
also lacking. We want to improve the utility and user
experience of this mailing list.
 5. We don’t want to fragment this “official” discussion community.
 6. Nabble is an independent product not affiliated with the ASF.
It offers a slightly better interface to the Apache mailing
list archives.

So to respond to some of your points, pzecevic:

Apache user group could be frozen (not accepting new
questions, if that’s possible) and redirect users to Stack
Overflow (automatic reply?).

From what I understand of the ASF’s policies, this is not
possible. :( This mailing list must remain the official Spark
user discussion platform.

Other thing, about new Stack Exchange site I proposed
earlier. If a new site is created, there is no problem with
guidelines, I think, because Spark community can apply
different guidelines for the new site.

I think Stack Overflow and the various Spark tags are working
fine. I don’t see a compelling need for a Stack Exchange
dedicated to Spark, either now or in the near future. Also, I
doubt a Spark-specific site can pass the 4 tests in the Area 51
FAQ http://area51.stackexchange.com/faq:

  * Almost all Spark questions are on-topic for Stack Overflow
  * Stack Overflow already exists, it already has a tag for
Spark, and nobody is complaining
  * You’re not creating such a big group that you don’t have
enough experts to answer all possible questions
  * There’s a high probability that users of Stack Overflow would
enjoy seeing the occasional question about Spark

I think complaining won’t be sufficient. :)

Someone expressed a concern that they won’t allow creating a
project-specific site, but there already exist some
project-specific sites, like Tor, Drupal, Ubuntu…

The communities for these projects are many, many times larger
than the Spark community is or likely ever will be, simply due to
the nature of the problems they are solving.

What we need is an improvement to this mailing list. We need
better tooling than Nabble to sit on top of the Apache archives,
and we also need some way to control the volume and quality of
mail on the list so that it remains a useful resource for the
majority of users.

Nick



On Wed Jan 21 2015 at 3:13:21 PM pzecevic
petar.zece...@gmail.com mailto:petar.zece...@gmail.com wrote:

Hi,
I tried to find the last reply by Nick Chammas (that I
received in the
digest) using the Nabble web interface, but I cannot find it
(perhaps he
didn't reply directly to the user list?). That's one example
of Nabble's
usability.

Anyhow, I wanted to add my two cents...

Apache user group could be frozen (not accepting new
questions, if that's
possible) and redirect users to Stack Overflow (automatic
reply?). Old
questions remain (and are searchable) on Nabble, new
questions go to Stack
Exchange, so no need

Re: Is spark suitable for real time query

Re: Spark - Eclipse IDE - Maven

Re: Spark - Eclipse IDE - Maven

Re: Is spark suitable for real time query

Re: Fwd: Model weights of linear regression becomes abnormal values

Re: How to configure SparkUI to use internal ec2 ip

Re: Spark-submit and multiple files

Re: Hamburg Apache Spark Meetup

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

Re: Accumulator in SparkUI for streaming

Re: Posting to the list

Re: Missing shuffle files

Re: Where can I find logs set inside RDD processing functions?

Re: LeaseExpiredException while writing schemardd to hdfs

Re: Discourse: A proposed alternative to the Spark User list

Re: Discourse: A proposed alternative to the Spark User list

16 matches

Site Navigation

Mail list logo

Footer information