RE: Creating Partitioned Parquet Tables via SparkSQL

2015-04-01 Thread Felix Cheung
This is tracked by these JIRAs..
 
https://issues.apache.org/jira/browse/SPARK-5947
https://issues.apache.org/jira/browse/SPARK-5948
 
From: denny.g@gmail.com
Date: Wed, 1 Apr 2015 04:35:08 +
Subject: Creating Partitioned Parquet Tables via SparkSQL
To: user@spark.apache.org

Creating Parquet tables via .saveAsTable is great but was wondering if there 
was an equivalent way to create partitioned parquet tables.
Thanks!
  

RE: Streaming anomaly detection using ARIMA

2015-04-01 Thread Felix Cheung
I'm curious - I'm not sure if I understand you correctly. With SparkR, the work 
is distributed in Spark and computed in R, isn't that what your are looking for?
SparkR was on rJava for the R-JVM but moved away from it.
 
rJava has a component called JRI which allows JVM to call R.
You could call R with JRI or through rdd.forEachPartition(pass_data_to_R) or 
rdd.pipe
 
From: cjno...@gmail.com
Date: Wed, 1 Apr 2015 19:31:48 -0400
Subject: Re: Streaming anomaly detection using ARIMA
To: user@spark.apache.org

Surprised I haven't gotten any responses about this. Has anyone tried using 
rJava or FastR w/ Spark? I've seen the SparkR project but thta goes the other 
way- what I'd like to do is use R for model calculation and Spark to distribute 
the load across the cluster.
Also, has anyone used Scalation for ARIMA models? 
On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:
Taking out the complexity of the ARIMA models to simplify things- I can't seem 
to find a good way to represent even standard moving averages in spark 
streaming. Perhaps it's my ignorance with the micro-batched style of the 
DStreams API.

On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno...@gmail.com wrote:
I want to use ARIMA for a predictive model so that I can take time series data 
(metrics) and perform a light anomaly detection. The time series data is going 
to be bucketed to different time units (several minutes within several hours, 
several hours within several days, several days within several years.
I want to do the algorithm in Spark Streaming. I'm used to tuple at a time 
streaming and I'm having a tad bit of trouble gaining insight into how exactly 
the windows are managed inside of DStreams.
Let's say I have a simple dataset that is marked by a key/value tuple where the 
key is the name of the component who's metrics I want to run the algorithm 
against and the value is a metric (a value representing a sum for the time 
bucket. I want to create histograms of the time series data for each key in the 
windows in which they reside so I can use that histogram vector to generate my 
ARIMA prediction (actually, it seems like this doesn't just apply to ARIMA but 
could apply to any sliding average). 
I *think* my prediction code may look something like this:
val predictionAverages = dstream  .groupByKeyAndWindow(60*60*24, 60*60*24)
  .mapValues(applyARIMAFunction)
That is, keep 24 hours worth of metrics in each window and use that for the 
ARIMA prediction. The part I'm struggling with is how to join together the 
actual values so that i can do my comparison against the prediction model. 

Let's say dstream contains the actual values. For any time  window, I should be 
able to take a previous set of windows and use model to compare against the 
current values.





  

RE: SparkR csv without headers

2015-08-21 Thread Felix Cheung
You could also rename them with names
Unfortunately the API doesn't show the example of that 
https://spark.apache.org/docs/latest/api/R/index.html





On Thu, Aug 20, 2015 at 7:43 PM -0700, Sun, Rui rui@intel.com wrote:
Hi,

You can create a DataFrame using load.df() with a specified schema.

Something like:
schema - structType(structField(“a”, “string”), structField(“b”, integer), …)
read.df ( …, schema = schema)

From: Franc Carter [mailto:franc.car...@rozettatech.com]
Sent: Wednesday, August 19, 2015 1:48 PM
To: user@spark.apache.org
Subject: SparkR csv without headers


Hi,

Does anyone have an example of how to create a DataFrame in SparkR  which 
specifies the column names - the csv files I have do not have column names in 
the first row. I can get read a csv nicely with 
com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2, C3 
etc


thanks

--

Franc Carter I  Systems ArchitectI RoZetta Technology



[Description: Description: Description: cid:image003.jpg@01D02903.9B540580]



L4. 55 Harrington Street,  THE ROCKS,  NSW, 2000

PO Box H58, Australia Square, Sydney NSW, 1215, AUSTRALIA

T  +61 2 8355 2515tel:%2B61%202%208355%202515 I
www.rozettatechnology.comhttp://www.rozettatechnology.com/

[cid:image002.jpg@01D02903.0B41B280]

DISCLAIMER: The contents of this email, inclusive of attachments, may be legally

privileged and confidential. Any unauthorised use of the contents is expressly 
prohibited.




Re: SparkR in yarn-client mode needs sparkr.zip

2015-10-25 Thread Felix Cheung
This might be related to  https://issues.apache.org/jira/browse/SPARK-10500



On Sun, Oct 25, 2015 at 9:57 AM -0700, "Ted Yu"  wrote:
In zipRLibraries():

// create a zip file from scratch, do not append to existing file.
val zipFile = new File(dir, name)

I guess instead of creating sparkr.zip in the same directory as R lib, the
zip file can be created under some directory writable by the user launching
the app and accessible by user 'yarn'.

Cheers

On Sun, Oct 25, 2015 at 8:29 AM, Ram Venkatesh 
wrote:

> 
>
> If you run sparkR in yarn-client mode, it fails with
>
> Exception in thread "main" java.io.FileNotFoundException:
> /usr/hdp/2.3.2.1-12/spark/R/lib/sparkr.zip (Permission denied)
> at java.io.FileOutputStream.open0(Native Method)
> at java.io.FileOutputStream.open(FileOutputStream.java:270)
> at java.io.FileOutputStream.(FileOutputStream.java:213)
> at
>
> org.apache.spark.deploy.RPackageUtils$.zipRLibraries(RPackageUtils.scala:215)
> at
>
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:371)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> The behavior is the same when I use the pre-built spark-1.5.1-bin-hadoop2.6
> version also.
>
> Interestingly if I run as a user with write permissions to the R/lib
> directory, it succeeds. However, the sparkr.zip file is recreated each time
> sparkR is launched, so even if the file is present it has to be writable by
> the submitting user.
>
> Couple questions:
> 1. Can spark.zip be packaged once and placed in that location for multiple
> users
> 2. If not, is this location configurable, so that each user can specify a
> directory that they can write?
>
> Thanks!
> Ram
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-yarn-client-mode-needs-sparkr-zip-tp25194.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Felix Cheung
Please open a JIRA?

 
Date: Mon, 26 Oct 2015 15:32:42 +0200
Subject: HiveContext ignores ("skip.header.line.count"="1")
From: daniel.ha...@veracity-group.com
To: user@spark.apache.org

Hi,I have a csv table in Hive which is configured to skip the header row using 
TBLPROPERTIES("skip.header.line.count"="1").When querying from Hive the header 
row is not included in the data, but when running the same query via 
HiveContext I get the header row.

I made sure that HiveContext sees the skip.header.line.count setting by running 
"show create table"
Any ideas?
Thank you.
Daniel

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Felix Cheung
+1 on that. It would be useful to use the model outside of Spark.



_
From: DB Tsai 
Sent: Wednesday, November 11, 2015 11:57 PM
Subject: Re: thought experiment: use spark ML to real time prediction
To: Nirmal Fernando 
Cc: Andy Davidson , Adrian Tanase 
, user @spark 


   Do you think it will be useful to separate those models and model 
loader/writer code into another spark-ml-common jar without any spark platform 
dependencies so users can load the models trained by Spark ML in their 
application and run the prediction?   

Sincerely, 
 
DB Tsai 
-- 
Web:  https://www.dbtsai.com 
PGP Key ID: 0xAF08DF8D   
   On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando  
wrote:
   As of now, we are basically serializing the ML model and then 
deserialize it for prediction at real time.   
 On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase  
 wrote: 
 I 
don’t think this answers your question but here’s how you would evaluate the 
model in realtime in a streaming app 
https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html

 
Maybe you can find a way to extract 
portions of MLLib and run them outside of spark – loading the precomputed model 
and calling .predict on it…   
-adrian   
   From: Andy Davidson  
   
  Date: Tuesday, November 10, 2015 at 11:31 PM 
 To: "user @spark"
 Subject: thought experiment: use spark ML to real time prediction


Lets say I have use spark ML to train a linear model. I know I can save and 
load the model to disk. I am not sure how I can use the model in a real time 
environment. For example I do not think I can return a “prediction” to the 
client using spark streaming easily. Also for some applications the extra 
latency created by the batch process might not be acceptable.   
   
   If I was not using spark I 
would re-implement the model I trained in my batch environment in a lang like 
Java  and implement a rest service that uses the model to create a prediction 
and return the prediction to the client. Many models make predictions using 
linear algebra. Implementing predictions is relatively easy if you have a good 
vectorized LA package. Is there a way to use a model I trained using spark ML 
outside of spark?  
   As a motivating example, 
even if its possible to return data to the client using spark streaming. I 
think the mini batch latency would not be acceptable for a high frequency stock 
trading system.  
   Kind regards 
 
   Andy 
 
   P.s. The examples I have 
seen so far use spark streaming to “preprocess” predictions. For example a 
recommender system might use what current users are watching to calculate 
“trending recommendations”. These are stored on disk and served up to users 
when the use the “movie guide”. If a recommendation was a couple of min. old it 
would not effect the end users experience.  


   

 
   -- 
 
Thanks & regards,  
Nirmal  
  
Team Lead - WSO2 Machine Learner  
Associate Technical Lead - Data Technologies Team, WSO2 Inc.  
Mobile:   +94715779733  
Blog:   http://nirmalfdo.blogspot.com/

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Felix Cheung
Is it possible that your user does not have permission to write temp file?






On Tue, Oct 6, 2015 at 10:26 AM -0700, "akhandeshi"  
wrote:
It seems it is failing at
 path <- tempfile(pattern = "backend_port")  I do not see backend_port
directory created...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-Error-in-sparkR-init-master-local-in-RStudio-tp23768p24958.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SparkR: exported functions

2015-08-26 Thread Felix Cheung
I believe that is done explicitly while the final API is being figured out.
For the moment you could use DataFrame read.df()
 
 From: csgilles...@gmail.com
 Date: Tue, 25 Aug 2015 18:26:50 +0100
 Subject: SparkR: exported functions
 To: user@spark.apache.org
 
 Hi,
 
 I've just started playing about with SparkR (Spark 1.4.1), and noticed
 that a number of the functions haven't been exported. For example,
 the textFile function
 
 https://github.com/apache/spark/blob/master/R/pkg/R/context.R
 
 isn't exported, i.e. the function isn't in the NAMESPACE file. This is 
 obviously
 due to the ' missing in the roxygen2 directives.
 
 Is this intentional?
 
 Thanks
 
 Colin
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
  

RE: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-27 Thread Felix Cheung
May I ask how you are starting Spark?
It looks like PYTHONHASHSEED is being set: 
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED

 
Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I get and exception 'Randomness of hash of string 
should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not 
just set PYTHONHASHSEED ?
Should I file a bug?
Kind regards
Andy
details
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
Example does not work out of the box
Subtract(other, numPartitions=None)Return each value in self that is not 
contained in other.>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 
3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
It raises 
if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via 
PYTHONHASHSEED")


The following script fixes the problem 
Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
Exception'Randomness of hash of string should be disabled via 
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh

sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh  
/root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`

Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done


  

Re: SparkR read.df failed to read file from local directory

2015-12-08 Thread Felix Cheung
Have you tried
flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", 
source = "com.databricks.spark.csv", header = "true")    



_
From: Boyu Zhang 
Sent: Tuesday, December 8, 2015 8:47 AM
Subject: SparkR read.df failed to read file from local directory
To:  


   Hello everyone,  
  I tried to run the example data--manipulation.R, and can't get it to 
read the flights.csv file that is stored in my local fs. I don't want to store 
big files in my hdfs, so reading from a local fs (lustre fs) is the desired 
behavior for me.  
  I tried the following:  
   flightsDF <- read.df(sqlContext, 
"file:///home/myuser/test_data/sparkR/flights.csv", source = 
"com.databricks.spark.csv", header = "true")    
   
  I got the message and eventually failed:  
   15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added 
broadcast_6_piece0 in memory on  hathi-a003.rcac.purdue.edu:33894 (size: 
14.4 KB, free: 530.2 MB) 15/12/08 11:42:41 WARN 
scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 9,  
hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException: File 
file:/home/myuser/test_data/sparkR/flights.csv does not exist  at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
  at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)  
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
  at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)   
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) 
 at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106) 
 at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
  at 
org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)  
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)  at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)  at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)  
at org.apache.spark.scheduler.Task.run(Task.scala:88)  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)   
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
 at java.lang.Thread.run(Thread.java:745)  
  Can someone please provide comments? Any tips is appreciated, thank 
you!  
  Boyu Zhang  
   


  

RE: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-29 Thread Felix Cheung
Actually upon closer look PYTHONHASHSEED should be set (in worker) when your 
create a SparkContext
 
https://github.com/apache/spark/blob/master/python/pyspark/context.py#L166
 
And it should also be set from spark-submit or pyspark.
 
Can you check sys.version and os.environ.get("PYTHONHASHSEED")?
 
Date: Sun, 29 Nov 2015 09:48:19 -0800
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: felixcheun...@hotmail.com; yuzhih...@gmail.com
CC: user@spark.apache.org

Hi Felix and Ted
This is how I am starting spark
Should I file a bug?
Andy

export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  
$SPARK_ROOT/bin/pyspark \
--master $MASTER_URL \
--total-executor-cores $numCores \
--driver-memory 2G \
--executor-memory 2G \
$extraPkgs \
$*
From:  Felix Cheung <felixcheun...@hotmail.com>
Date:  Saturday, November 28, 2015 at 12:11 AM
To:  Ted Yu <yuzhih...@gmail.com>
Cc:  Andrew Davidson <a...@santacruzintegration.com>, "user @spark" 
<user@spark.apache.org>
Subject:  Re: possible bug spark/python/pyspark/rdd.py portable_hash()


Ah, it's there in spark-submit and pyspark.Seems like it should be added 
for spark_ec2



_
From: Ted Yu <yuzhih...@gmail.com>
Sent: Friday, November 27, 2015 11:50 AM
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark 
<user@spark.apache.org>


   ec2/spark-ec2 calls ./ec2/spark_ec2.py   
   
  I don't see PYTHONHASHSEED defined in any of these scripts.  
  Andy reported this for ec2 cluster.  
  I think a JIRA should be opened.  
  
   On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung 
<felixcheun...@hotmail.com> wrote:
   May I ask how you are starting Spark?   
It looks like PYTHONHASHSEED is being set:
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED   
   

   Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

 I am using  spark-1.5.1-bin-hadoop2.6. I used  
spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster  
and configured spark-env to use python3. I get and exception ' 
Randomness of hash of string should be disabled via PYTHONHASHSEED’.  
Is there any reason rdd.py should not just set PYTHONHASHSEED ? 

 Should I file a bug? 
 Kind regards 
 Andy 
 details 
 
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
 
 Example does not work out of the box   
  
  Subtract(   other,
numPartitions=None)  Return each 
value in self that is not contained in other.   
 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = 
sc.parallelize([("a", 3), ("c", None)])>>> 
sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]   
It raises  
 if sys.version >= '3.3' and 'PYTHONHASHSEED' not 
in os.environ:raise Exception("Randomness of hash of string should be 
disabled via PYTHONHASHSEED") 
 
 The following script fixes the problem 
 
 Sudo printf "\n# set PYTHONHASHSEED so python3 will 
not generate Exception'Randomness of hash of string should be disabled via 
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh  

 sudo pssh -i -h /root/spark-ec2/slaves cp 
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
"+%Y-%m-%d:%H:%M"`  
 Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done 
 
 
  



  

Re: Python API Documentation Mismatch

2015-12-03 Thread Felix Cheung
Please open an issue in JIRA, thanks!






On Thu, Dec 3, 2015 at 3:03 AM -0800, "Roberto Pagliari" 
 wrote:





Hello,
I believe there is a mismatch between the API documentation (1.5.2) and the 
software currently available.

Not all functions mentioned here
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.recommendation

are, in fact available. For example, the code below from the tutorial works

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

While the alternative shown in the API documentation will not (it will complain 
that ALS takes no arguments. Also, but inspecting the module with Python 
utilities I could not find several methods mentioned in the API docs)

>>> df = sqlContext.createDataFrame(
... [(0, 0, 4.0), (0, 1, 2.0), (1, 1, 3.0), (1, 2, 4.0), (2, 1, 1.0), (2, 
2, 5.0)],
... ["user", "item", "rating"])
>>> als = ALS(rank=10, maxIter=5)
>>> model = als.fit(df)


Thank you,



Re: SparkR in Spark 1.5.2 jsonFile Bug Found

2015-12-03 Thread Felix Cheung
It looks like this has been broken around Spark 1.5.
Please see JIRA SPARK-10185. This has been fixed in pyspark but unfortunately 
SparkR was missed. I have confirmed this is still broken in Spark 1.6.
Could you please open a JIRA?






On Thu, Dec 3, 2015 at 2:08 PM -0800, "tomasr3" 
 wrote:





Hello,

I believe to have encountered a bug with Spark 1.5.2. I am using RStudio and
SparkR to read in JSON files with jsonFile(sqlContext, "path"). If "path" is
a single path (e.g., "/path/to/dir0"), then it works fine;

but, when "path" is a vector of paths (e.g.

path <- c("/path/to/dir1","/path/to/dir2"), then I get the following error
message:

> raw.terror<-jsonFile(sqlContext,path)
15/12/03 15:59:55 ERROR RBackendHandler: jsonFile on 1 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  java.io.IOException: No input paths specified in job
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2

Note that passing a vector of paths in Spark-1.4.1 works just fine. Any help
is greatly appreciated if this is not a bug and perhaps an environment or
different issue.

Best,
T



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-in-Spark-1-5-2-jsonFile-Bug-Found-tp25560.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sparkR ORC support.

2016-01-06 Thread Felix Cheung
Yes, as Yanbo suggested, it looks like there is something wrong with the 
sqlContext.
Could you forward us your code please?






On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" <yblia...@gmail.com> wrote:





You should ensure your sqlContext is HiveContext.

sc <- sparkR.init()

sqlContext <- sparkRHive.init(sc)


2016-01-06 20:35 GMT+08:00 Sandeep Khurana <sand...@infoworks.io>:

> Felix
>
> I tried the option suggested by you.  It gave below error.  I am going to
> try the option suggested by Prem .
>
> Error in writeJobj(con, object) : invalid jobj 1
> 8
> stop("invalid jobj ", value$id)
> 7
> writeJobj(con, object)
> 6
> writeObject(con, a)
> 5
> writeArgs(rc, args)
> 4
> invokeJava(isStatic = TRUE, className, methodName, ...)
> 3
> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
> source, options)
> 2
> read.df(sqlContext, filepath, "orc") at
> spark_api.R#108
>
> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung <felixcheun...@hotmail.com>
> wrote:
>
>> Firstly I don't have ORC data to verify but this should work:
>>
>> df <- loadDF(sqlContext, "data/path", "orc")
>>
>> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
>> should be called after sparkR.init() - please check if there is any error
>> message there.
>>
>> _
>> From: Prem Sure <premsure...@gmail.com>
>> Sent: Tuesday, January 5, 2016 8:12 AM
>> Subject: Re: sparkR ORC support.
>> To: Sandeep Khurana <sand...@infoworks.io>
>> Cc: spark users <user@spark.apache.org>, Deepak Sharma <
>> deepakmc...@gmail.com>
>>
>>
>>
>> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>>
>>
>> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana <sand...@infoworks.io>
>> wrote:
>>
>>> Also, do I need to setup hive in spark as per the link
>>> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
>>> ?
>>>
>>> We might need to copy hdfs-site.xml file to spark conf directory ?
>>>
>>> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana <sand...@infoworks.io>
>>> wrote:
>>>
>>>> Deepak
>>>>
>>>> Tried this. Getting this error now
>>>>
>>>> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   
>>>> unused argument ("")
>>>>
>>>>
>>>> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma <deepakmc...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Sandeep
>>>>> can you try this ?
>>>>>
>>>>> results <- sql(hivecontext, "FROM test SELECT id","")
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>>
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana <sand...@infoworks.io>
>>>>> wrote:
>>>>>
>>>>>> Thanks Deepak.
>>>>>>
>>>>>> I tried this as well. I created a hivecontext   with  "hivecontext
>>>>>> <<- sparkRHive.init(sc) "  .
>>>>>>
>>>>>> When I tried to read hive table from this ,
>>>>>>
>>>>>> results <- sql(hivecontext, "FROM test SELECT id")
>>>>>>
>>>>>> I get below error,
>>>>>>
>>>>>> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
>>>>>> SparkR was restarted, Spark operations need to be re-executed.
>>>>>>
>>>>>>
>>>>>> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma <deepakmc...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Sandeep
>>>>>>> I am not sure if ORC can be read directly in R.
>>>>>>> But there can be a workaround .First create hive table on top of ORC
>>>>>>> files and then access hive table in R.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Deepak
>>>>>>>
>>>>>>> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana <
>>>>>>> sand...@infoworks.io> wrote:
>>>>>>>
>>>>>>>> Hello
>>>>>>>>
>>>>>>>> I need to read an ORC files in hdfs in R using spark. I am not able
>>>>>>>> to find a package to do that.
>>>>>>>>
>>>>>>>> Can anyone help with documentation or example for this purpose?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Architect
>>>>>>>> Infoworks.io <http://infoworks.io>
>>>>>>>> http://Infoworks.io
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks
>>>>>>> Deepak
>>>>>>> www.bigdatabig.com
>>>>>>> www.keosha.net
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Architect
>>>>>> Infoworks.io <http://infoworks.io>
>>>>>> http://Infoworks.io
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks
>>>>> Deepak
>>>>> www.bigdatabig.com
>>>>> www.keosha.net
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Architect
>>>> Infoworks.io <http://infoworks.io>
>>>> http://Infoworks.io
>>>>
>>>
>>>
>>>
>>> --
>>> Architect
>>> Infoworks.io <http://infoworks.io>
>>> http://Infoworks.io
>>>
>>
>>
>>
>>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>


Re: pyspark Dataframe and histogram through ggplot (python)

2016-01-05 Thread Felix Cheung
Hi,
select() returns a new Spark DataFrame; I would imagine ggplot would not work 
with it. Could you try df.select("something").toPandas()?


_
From: Snehotosh Banerjee 
Sent: Tuesday, January 5, 2016 4:32 AM
Subject: pyspark Dataframe and histogram through ggplot (python)
To:  


   Hi,   
  I am facing issue in rendering charts through ggplot while working on 
pyspark Dataframe on a dummy dataset.
 I have created a Spark Dataframe and trying to draw a histogram through ggplot 
in python.  

  
 
I have a valid schema as below.But, below command is not working
 ggplot(df_no_null, 
aes(df_no_null.select('total_price_excluding_optional_support'))) + 
geom_histogram()
 
 Appreciate input.
  
  Regards,  Snehotosh   


  

Re: sparkR ORC support.

2016-01-05 Thread Felix Cheung
Firstly I don't have ORC data to verify but this should work:
df <- loadDF(sqlContext, "data/path", "orc")
Secondly, could you check if sparkR.stop() was called? sparkRHive.init() should 
be called after sparkR.init() - please check if there is any error message 
there.


_
From: Prem Sure 
Sent: Tuesday, January 5, 2016 8:12 AM
Subject: Re: sparkR ORC support.
To: Sandeep Khurana 
Cc: spark users , Deepak Sharma 


   Yes Sandeep, also copy hive-site.xml too to spark conf directory.   
   
   
   On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
 wrote:
   Also, do I need to setup hive in spark as per the link   
http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark ? 

   We might need to copy hdfs-site.xml file to spark conf 
directory ? 
 On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana  
 wrote: 
  Deepak   
  Tried this. Getting this error now
  rror in sql(hivecontext, "FROM CATEGORIES SELECT 
category_id", "") :   unused argument ("")  
 
   On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma
    wrote:  
  Hi Sandeep
 can you try this ? 

   results <- 
sql(hivecontext, "FROM test SELECT id","")  
  
Thanks  
   Deepak   
 

 
 On Tue, Jan 5, 2016 at 5:49 PM, Sandeep 
Khurana wrote:   
Thanks Deepak.  
 
I tried this as 
well. I created a hivecontext   with  "hivecontext <<- sparkRHive.init(sc) "  . 
   
When I tried to 
read hive table from this , 

results <- 
sql(hivecontext, "FROM test SELECT id") 
   
I get below 
error, 
Error in 
callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If SparkR was 
restarted, Spark operations need to be re-executed.  
  Not sure what is causing this? Any leads or ideas? I am 
using rstudio.   

   
   On Tue, Jan 5, 2016 at 5:35 PM, 
Deepak Sharma  wrote:


  Hi Sandeep
   I am not sure if ORC can be read directly in R.  
 
But there can be a workaround .First create hive table on top of ORC files and 
then access hive table in R.
   
   
Thanks  
 Deepak 


 On Tue, Jan 5, 2016 at 
4:57 PM, Sandeep Khurana   
wrote: 
 

Re: Do existing R packages work with SparkR data frames

2015-12-23 Thread Felix Cheung
Hi
SparkR has some support for machine learning algorithm like glm.
For existing R packages, currently you would need to collect to convert into R 
data.frame - assuming it fits into the memory of the driver node, though that 
would be required to work with R package in any case.



_
From: Lan 
Sent: Tuesday, December 22, 2015 4:50 PM
Subject: Do existing R packages work with SparkR data frames
To:  


   Hello,   

 Is it possible for existing R Machine Learning packages (which work with R   
 data frames) such as bnlearn, to work with SparkR data frames? Or do I need   
 to convert SparkR data frames to R data frames? Is "collect" the function to   
 do the conversion, or how else to do that?   

 Many Thanks,   
 Lan   



 --   
 View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-existing-R-packages-work-with-SparkR-data-frames-tp25772.html
   
 Sent from the Apache Spark User List mailing list archive at Nabble.com.   

 -   
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org   
 For additional commands, e-mail: user-h...@spark.apache.org   

   


  

Re: number of executors in sparkR.init()

2015-12-25 Thread Felix Cheung
The equivalent for spark-submit --num-executors should be 
spark.executor.instancesWhen use in 
SparkConf?http://spark.apache.org/docs/latest/running-on-yarn.html
Could you try setting that with sparkR.init()?


_
From: Franc Carter 
Sent: Friday, December 25, 2015 9:23 PM
Subject: number of executors in sparkR.init()
To:  


   Hi,   
  I'm having trouble working out how to get the number of executors set 
when using sparkR.init().  
  If I start sparkR with  
    sparkR  --master yarn --num-executors 6   
  then I get 6 executors  
  However, if start sparkR with  
    sparkR
 
 followed by 
   sc <- sparkR.init(master="yarn-client",   
sparkEnvir=list(spark.num.executors='6')) 
 then I only get 2 executors. 
 Can anyone point me in the direction of what I might doing wrong ? 
I need to initialise this was so that rStudio can hook in to SparkR 
 thanks  
   --
   Franc


  

Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance

2015-12-28 Thread Felix Cheung
Make sure you add the csv spark package as this example here so that the source 
parameter in R read.df would work:

https://spark.apache.org/docs/latest/sparkr.html#from-data-sources


_
From: Andy Davidson 
Sent: Monday, December 28, 2015 10:24 AM
Subject: Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
To: zhangjp <592426...@qq.com>, Yanbo Liang 
Cc: user 


   Hi Yanbo   
   I use spark.csv to load my data set. I work with both Java and Python. I 
would recommend you print the first couple of rows and also print the schema to 
make sure your data is loaded as you expect. You might find the following code 
example helpful. You may need to programmatically set the schema depending on 
what you data looks like   
   
   

public class LoadTidyDataFrame {

    static  DataFrame fromCSV(SQLContext sqlContext, String file) {

        DataFrame df = sqlContext.read()

                .format("com.databricks.spark.csv")

                .option("inferSchema", "true")

                .option("header", "true")

                .load(file);

        

        return df;

    }

}   
   
   
   From:  Yanbo Liang 
Date:  Monday, December 28, 2015 at 2:30 AM
To:  zhangjp <592426...@qq.com>
Cc:  "user @spark" 
Subject:  Re: how to use sparkR or spark MLlib load csv file on hdfs then 
calculate covariance
  
   Load csv file:   df <- read.df(sqlContext, "file-path", 
source = "com.databricks.spark.csv", header = "true")Calculate 
covariance:cov <- cov(df, "col1", "col2")
CheersYanbo
  
   2015-12-28 17:21 GMT+08:00 zhangjp   <592426...@qq.com>:  
 hi  all,  I want  to use sparkR or 
spark MLlib  load csv file on hdfs then calculate  covariance, how to do it .   
 thks.   
 


  

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

2015-11-28 Thread Felix Cheung
Ah, it's there in spark-submit and pyspark.Seems like it should be added for 
spark_ec2



_
From: Ted Yu <yuzhih...@gmail.com>
Sent: Friday, November 27, 2015 11:50 AM
Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark 
<user@spark.apache.org>


   ec2/spark-ec2 calls ./ec2/spark_ec2.py   
   
  I don't see PYTHONHASHSEED defined in any of these scripts.  
  Andy reported this for ec2 cluster.  
  I think a JIRA should be opened.  
  
   On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung 
<felixcheun...@hotmail.com> wrote:
   May I ask how you are starting Spark?   
It looks like PYTHONHASHSEED is being set:
https://github.com/apache/spark/search?utf8=%E2%9C%93=PYTHONHASHSEED   
   
    
   Date: Thu, 26 Nov 2015 11:30:09 -0800
Subject: possible bug spark/python/pyspark/rdd.py portable_hash()
From: a...@santacruzintegration.com
To: user@spark.apache.org

 I am using  spark-1.5.1-bin-hadoop2.6. I used  
spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster  
and configured spark-env to use python3. I get and exception ' 
Randomness of hash of string should be disabled via PYTHONHASHSEED’.  
Is there any reason rdd.py should not just set PYTHONHASHSEED ? 

 Should I file a bug? 
 Kind regards 
 Andy 
 details 
 
http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
 
 Example does not work out of the box   
  
  Subtract(   other,    
numPartitions=None)  

Return each value in self that is not contained in other.   
 >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 
3)])>>> y = sc.parallelize([("a", 3), ("c", None)])>>> 
sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]   
It raises  
 if sys.version >= '3.3' and 'PYTHONHASHSEED' not 
in os.environ:raise Exception("Randomness of hash of string should be 
disabled via PYTHONHASHSEED") 
 
 The following script fixes the problem 
 
 Sudo printf "
# set PYTHONHASHSEED so python3 will not generate Exception'Randomness of hash 
of string should be disabled via PYTHONHASHSEED'
export PYTHONHASHSEED=123
" >> /root/spark/conf/spark-env.sh  
 sudo pssh -i -h /root/spark-ec2/slaves cp 
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
"+%Y-%m-%d:%H:%M"`  
 Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done 
 
 
  



  

RE: sparkR ORC support.

2016-01-12 Thread Felix Cheung
It looks like you have overwritten sc. Could you try this:
 
 
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), 
.libPaths()))library(SparkR)
sc <- sparkR.init()hivecontext <- sparkRHive.init(sc)df <- loadDF(hivecontext, 
"/data/ingest/sparktest1/", "orc") 

 
Date: Tue, 12 Jan 2016 14:28:58 +0530
Subject: Re: sparkR ORC support.
From: sand...@infoworks.io
To: felixcheun...@hotmail.com
CC: yblia...@gmail.com; user@spark.apache.org; premsure...@gmail.com; 
deepakmc...@gmail.com

The code is very simple, pasted below .  hive-site.xml is in spark conf 
already. I still see this error Error in writeJobj(con, object) : invalid jobj 3
after running the script  below

script===Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), 
.libPaths()))library(SparkR)
sc <<- sparkR.init()sc <<- sparkRHive.init()hivecontext <<- 
sparkRHive.init(sc)df <- loadDF(hivecontext, "/data/ingest/sparktest1/", 
"orc")#View(df)

On Wed, Jan 6, 2016 at 11:08 PM, Felix Cheung <felixcheun...@hotmail.com> wrote:





Yes, as Yanbo suggested, it looks like there is something wrong with the 
sqlContext.



Could you forward us your code please?













On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang" 
<yblia...@gmail.com> wrote:





You should ensure your sqlContext is HiveContext.

sc <- sparkR.init()
sqlContext <- sparkRHive.init(sc)




2016-01-06 20:35 GMT+08:00 Sandeep Khurana 
<sand...@infoworks.io>:


Felix



I tried the option suggested by you.  It gave below error.  I am going to try 
the option suggested by Prem .





Error in writeJobj(con, object) : invalid jobj 1




8

stop("invalid jobj ", value$id)




7

writeJobj(con, object)




6

writeObject(con, a)




5

writeArgs(rc, args)




4

invokeJava(isStatic = TRUE, className, methodName, ...)




3

callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext, 
source, options)




2

read.df(sqlContext, filepath, "orc") at
spark_api.R#108








On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
<felixcheun...@hotmail.com> wrote:



Firstly I don't have ORC data to verify but this should work:



df <- loadDF(sqlContext, "data/path", "orc")



Secondly, could you check if sparkR.stop() was called? sparkRHive.init() should 
be called after sparkR.init() - please check if there is any error message 
there.






_

From: Prem Sure <premsure...@gmail.com>

Sent: Tuesday, January 5, 2016 8:12 AM

Subject: Re: sparkR ORC support.

To: Sandeep Khurana <sand...@infoworks.io>

Cc: spark users <user@spark.apache.org>, Deepak Sharma <deepakmc...@gmail.com>







Yes Sandeep, also copy hive-site.xml too to spark conf directory. 






On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
<sand...@infoworks.io> wrote:



Also, do I need to setup hive in spark as per the link  
http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark ?



We might need to copy hdfs-site.xml file to spark conf directory ? 





On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
<sand...@infoworks.io> wrote:



Deepak



Tried this. Getting this error now 

rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   unused 
argument ("")






On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
<deepakmc...@gmail.com> wrote:




Hi Sandeep 
can you try this ? 




results <- sql(hivecontext, "FROM test SELECT id","") 



Thanks 

Deepak 









On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
<sand...@infoworks.io> wrote:



Thanks Deepak.



I tried this as well. I created a hivecontext   with  "hivecontext <<- 
sparkRHive.init(sc) "  .




When I tried to read hive table from this ,  



results <- sql(hivecontext, "FROM test SELECT id") 



I get below error,  




Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If SparkR 
was restarted, Spark operations need to be re-executed.


Not sure what is causing this? Any leads or ideas? I am using rstudio. 








On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
<deepakmc...@gmail.com> wrote:




Hi Sandeep 
I am not sure if ORC can be read directly in R. 
But there can be a workaround .First create hive table on top of ORC files and 
then access hive table in R.




Thanks 
Deepak 





On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana 
<sand...@infoworks.io> wrote:



Hello



I need to read an ORC files in hdfs in R using spark. I am not able to find a 
package to do that. 




Can anyone help with documentation or example for this purpos

Re: sparkR ORC support.

2016-01-12 Thread Felix Cheung
As you can see from my reply below from Jan 6, calling sparkR.stop() 
invalidates both sc and hivecontext you have and results in this invalid jobj 
error.
If you start R and run this, it should work:
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), 
.libPaths()))library(SparkR)
sc <- sparkR.init()hivecontext <- sparkRHive.init(sc)df <- loadDF(hivecontext, 
"/data/ingest/sparktest1/", "orc") 
Is there a reason you want to call stop? If you do, you would need to call the 
line hivecontext <- sparkRHive.init(sc) again.



_
From: Sandeep Khurana <sand...@infoworks.io>
Sent: Tuesday, January 12, 2016 5:20 AM
Subject: Re: sparkR ORC support.
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: spark users <user@spark.apache.org>, Prem Sure <premsure...@gmail.com>, 
Deepak Sharma <deepakmc...@gmail.com>, Yanbo Liang <yblia...@gmail.com>


   It worked for sometime. Then I did  sparkR.stop() an re-ran again to get 
the same error. Any idea why it ran fine before ( while running fine it kept 
giving warning reusing existing spark-context and that I should restart) ? 
There is one more R code which instantiated spark , I ran that too again.   
   
 On Tue, Jan 12, 2016 at 3:05 PM, Sandeep Khurana
<sand...@infoworks.io> wrote:   
Complete stacktrace is. Can it be something wih java 
versions?    
 

  stop("invalid jobj ", value$id)   
   8
writeJobj(con, object)  
7   
 writeObject(con, a)
  6 
   writeArgs(rc, args)  
5   
invokeJava(isStatic = TRUE, className, 
methodName, ...)
 4  
 callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", 
sqlContext, source, options)
 3  
  read.df(sqlContext, path, source, schema, ...)
  2 
   loadDF(hivecontext, filepath, "orc") 


   On Tue, Jan 12, 2016 at 2:41 PM, Sandeep Khurana 
<sand...@infoworks.io> wrote:
   Running this gave
 
   16/01/12 04:06:54 INFO 
BlockManagerMaster: Registered BlockManagerError in writeJobj(con, object) : 
invalid jobj 3   
   How does it know which hive schema to connect to?
   
    
 
 On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung  
<felixcheun...@hotmail.com> wrote: 
   It 
looks like you have overwritten sc. Could you try this:
 
 
   
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")  
  
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))  
  library(SparkR)   
 
sc <- sparkR.init() 
   hivecontext <- sparkRHive.init(sc)   
 df <- loadDF(hivecontext, 
"/data/ingest/sparktest1/", "orc")  

 
 Date: Tue, 12 Jan 2016 14:28:58 +0530  
   
Subject: Re: sparkR ORC sup

RE: GraphX Java API

2016-06-08 Thread Felix Cheung
You might want to check out GraphFrames
graphframes.github.io





On Sun, Jun 5, 2016 at 6:40 PM -0700, "Santoshakhilesh" 
 wrote:





Ok , thanks for letting me know. Yes Since Java and scala programs ultimately 
runs on JVM. So the APIs written in one language can be called from other.
When I had used GraphX (around 2015 beginning) the Java Native APIs were not 
available for GraphX.
So I chose to develop my application in scala and it turned out much simpler to 
develop  in scala due to some of its powerful functions like lambda , map , 
filter etc… which were not available to me in Java 7.
Regards,
Santosh Akhilesh

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: 01 June 2016 00:56
To: Santoshakhilesh
Cc: Kumar, Abhishek (US - Bengaluru); user@spark.apache.org; Golatkar, Jayesh 
(US - Bengaluru); Soni, Akhil Dharamprakash (US - Bengaluru); Matta, Rishul (US 
- Bengaluru); Aich, Risha (US - Bengaluru); Kumar, Rajinish (US - Bengaluru); 
Jain, Isha (US - Bengaluru); Kumar, Sandeep (US - Bengaluru)
Subject: Re: GraphX Java API

Its very much possible to use GraphX through Java, though some boilerplate may 
be needed. Here is an example.

Create a graph from edge and vertex RDD (JavaRDD> 
vertices, JavaRDD edges )


ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
Graph graph = Graph.apply(vertices.rdd(),
edges.rdd(), 0L, 
StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(),
longTag, longTag);



Then basically you can call graph.ops() and do available operations like 
triangleCounting etc,

Best Regards,
Sonal
Founder, Nube Technologies
Reifier at Strata Hadoop World
Reifier at Spark Summit 
2015




On Tue, May 31, 2016 at 11:40 AM, Santoshakhilesh 
> wrote:
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) 
[mailto:abhishekkuma...@deloitte.com]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx
•   
org.apache.spark.graphx.impl
•   
org.apache.spark.graphx.lib
•   
org.apache.spark.graphx.util
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
>; 
user@spark.apache.org
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1










RE: SparkContext SyntaxError: invalid syntax

2016-01-17 Thread Felix Cheung
Do you still need help on the PR?
btw, does this apply to YARN client mode?
 
From: andrewweiner2...@u.northwestern.edu
Date: Sun, 17 Jan 2016 17:00:39 -0600
Subject: Re: SparkContext SyntaxError: invalid syntax
To: cutl...@gmail.com
CC: user@spark.apache.org

Yeah, I do think it would be worth explicitly stating this in the docs.  I was 
going to try to edit the docs myself and submit a pull request, but I'm having 
trouble building the docs from github.  If anyone else wants to do this, here 
is approximately what I would say:
(To be added to 
http://spark.apache.org/docs/latest/configuration.html#environment-variables)"Note:
 When running Spark on YARN in cluster mode, environment variables need to be 
set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]  property in 
your conf/spark-defaults.conf file.  Environment variables that are set in 
spark-env.sh will not be reflected in the YARN Application Master process in 
cluster mode.  See the YARN-related Spark Properties for more information."
I might take another crack at building the docs myself if nobody beats me to 
this.
Andrew

On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler  wrote:
Glad you got it going!  It's wasn't very obvious what needed to be set, maybe 
it is worth explicitly stating this in the docs since it seems to have come up 
a couple times before too.
Bryan
On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner 
 wrote:
Actually, I just found this [https://issues.apache.org/jira/browse/SPARK-1680], 
which after a bit of googling and reading leads me to believe that the 
preferred way to change the yarn environment is to edit the spark-defaults.conf 
file by adding this line:spark.yarn.appMasterEnv.PYSPARK_PYTHON
/path/to/python

While both this solution and the solution from my prior email work, I believe 
this is the preferred solution.
Sorry for the flurry of emails.  Again, thanks for all the help!
Andrew
On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner 
 wrote:
I finally got the pi.py example to run in yarn cluster mode.  This was the key 
insight:https://issues.apache.org/jira/browse/SPARK-9229

I had to set SPARK_YARN_USER_ENV in spark-env.sh:export 
SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
This caused the PYSPARK_PYTHON environment variable to be used in my yarn 
environment in cluster mode.
Thank you for all your help!
Best,Andrew


On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner 
 wrote:
I tried playing around with my environment variables, and here is an update.
When I run in cluster mode, my environment variables do not persist throughout 
the entire job.For example, I tried creating a local copy of HADOOP_CONF_DIR in 
/home//local/etc/hadoop/conf, and then, in spark-env.sh I the 
variable:export HADOOP_CONF_DIR=/home//local/etc/hadoop/conf
Later, when we print the environment variables in the python code, I see 
this:('HADOOP_CONF_DIR', '/etc/hadoop/conf')However, when I run in client mode, 
I see this:('HADOOP_CONF_DIR', '/home/awp066/local/etc/hadoop/conf')
Furthermore, if I omit that environment variable from spark-env.sh altogether, 
I get the expected error in both client and cluster mode:When running with 
master 'yarn'
either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.This 
suggests that my environment variables are being used when I first submit the 
job, but at some point during the job, my environment variables are thrown out 
and someone's (yarn's?) environment variables are being used.Andrew
On Fri, Jan 15, 2016 at 11:03 AM, Andrew Weiner 
 wrote:
Indeed!  Here is the output when I run in cluster mode:Traceback (most recent 
call last):
  File "pi.py", line 22, in ?
raise RuntimeError("\n"+str(sys.version_info) +"\n"+ 
RuntimeError: 
(2, 4, 3, 'final', 0)
[('PYSPARK_GATEWAY_PORT', '48079'), ('PYTHONPATH', 
'/scratch2/hadoop/yarn/local/usercache//filecache/116/spark-assembly-1.6.0-hadoop2.4.0.jar:/home//spark-1.6.0-bin-hadoop2.4/python:/home//code/libs:/scratch5/hadoop/yarn/local/usercache//appcache/application_1450370639491_0239/container_1450370639491_0239_01_01/pyspark.zip:/scratch5/hadoop/yarn/local/usercache//appcache/application_1450370639491_0239/container_1450370639491_0239_01_01/py4j-0.9-src.zip'),
 ('PYTHONUNBUFFERED', 'YES')]As we suspected, it is using Python 2.4
One thing that surprises me is that PYSPARK_PYTHON is not showing up in the 
list, even though I am setting it and exporting it in spark-submit and in 
spark-env.sh.  Is there somewhere else I need to set this variable?  Maybe in 
one of the hadoop conf files in my HADOOP_CONF_DIR?Andrew

On Thu, Jan 14, 2016 at 1:14 PM, Bryan Cutler  wrote:
It seems like it could be the case that some other Python version is being 
invoked.  To make sure, can you add something like this to the top of the .py 

Re: NA value handling in sparkR

2016-01-27 Thread Felix Cheung
That's correct - and because spark-csv as Spark package is not specifically 
aware of R's notion of  NA and interprets it as a string value.
On the other hand, R native NA is converted to NULL on Spark when creating a 
Spark DataFrame from a R data.frame. 
https://eradiating.wordpress.com/2016/01/04/whats-new-in-sparkr-1-6-0/



_
From: Devesh Raj Singh 
Sent: Wednesday, January 27, 2016 3:19 AM
Subject: Re: NA value handling in sparkR
To: Deborah Siegel 
Cc:  


   Hi,   
  

While dealing with missing values with R and SparkR I observed the following. 
Please tell me if I am right or wrong?




Missing values in native R are represented with a logical constant-NA. SparkR 
DataFrames represents missing values with NULL. If you use createDataFrame() to 
turn a local R data.frame into a distributed SparkR DataFrame, SparkR will 
automatically convert NA to NULL. 

                            However, if you are creating a SparkR DataFrame by 
reading in data from a file using read.df(), you may have strings of "NA", but 
not R logical constant NA missing value representations. String "NA" is not 
automatically converted to NULL.  
   On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel 
 wrote:
   Maybe not ideal, but since read.df is inferring all columns from 
the csv containing "NA" as type of strings, one could filter them rather than 
using dropna().  
   filtered_aq <- filter(aq, aq$Ozone != "NA" & 
aq$Solar_R != "NA")  head(filtered_aq)  
  Perhaps it would be better to have an option for read.df 
to convert any "NA" it encounters into null types, like createDataFrame does 
for , and then one would be able to use dropna() etc.   
 
  

 On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh  
 wrote: 
  Hi,   
  Yes you are right. 
 I think the problem is with reading of csv 
files. read.df is not considering NAs in the CSV file 
  
  So what would be a workable solution in 
dealing with NAs in csv files?  
  
   
   On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel  
  wrote:  
 Hi Devesh, 

I'm not certain why that's happening, and it looks like it doesn't happen if 
you use createDataFrame directly:
aq <- createDataFrame(sqlContext,airquality)
head(dropna(aq,how="any"))

If I had to guess.. dropna(), I believe, drops null values. I suppose its 
possible that createDataFrame converts R's  values to null, so dropna() 
works with that. But perhaps read.df() does not convert R s to null, as 
those are most likely interpreted as strings when they come in from the csv. 
Just a guess, can anyone confirm? 
 Deb 
   





  
 On Sun, Jan 24, 2016 at 11:05 PM, Devesh 
Raj Singh wrote:   
Hi, 
  


I have applied the following code on airquality dataset available in R , which 
has some missing values. I want to omit the rows which has NAs  


library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" 
"com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')  

sc <- sparkR.init("local",sparkHome = 
"/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")  

sqlContext <- sparkRSQL.init(sc)  

path<-"/Users/devesh/work/airquality/"  

aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv", 
header="true", inferSchema="true")   

Re: SparkR with Hive integration

2016-01-19 Thread Felix Cheung
You might need hive-site.xml



_
From: Peter Zhang 
Sent: Monday, January 18, 2016 9:08 PM
Subject: Re: SparkR with Hive integration
To: Jeff Zhang 
Cc:  


  Thanks,    
   I will try.   
   Peter 
  -- 
Google
Sent with Airmail
  

On January 19, 2016 at 12:44:46, Jeff Zhang (zjf...@gmail.com) wrote:   
  Please make sure you export environment variable 
HADOOP_CONF_DIR which contains the core-site.xml
On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang 
 wrote:
  Hi all,   
  
 
http://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes   
   From Hive tables   

 You can also create SparkR DataFrames from Hive tables. To do this we will 
need to create a HiveContext which can access tables in the Hive MetaStore. 
Note that Spark should have been built with Hive support and more details on 
the difference between SQLContext and HiveContext can be found in the SQL 
programming guide.# sc is an existing 
SparkContext.hiveContext <- sparkRHive.init(sc)sql(hiveContext, "CREATE TABLE 
IF NOT EXISTS src (key INT, value STRING)")sql(hiveContext, "LOAD DATA LOCAL 
INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")# Queries can be 
expressed in HiveQL.results <- sql(hiveContext, "FROM src SELECT key, value")# 
results is now a DataFramehead(results)## key value## 1 238 val_238## 2 86 
val_86## 3 311 val_311
   I use RStudio to run above command, when I run "  sql
  (  hiveContext  ,"CREATE TABLE IF NOT EXISTS src 
(key INT, value STRING)”  )”   
   I got exception: 16/01/19 12:11:51 INFO 
FileUtils: Creating directory if it doesn't exist: 
file:/user/hive/warehouse/src 16/01/19 12:11:51 ERROR DDLTask: 
org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:
file:/user/hive/warehouse/src is not a directory or unable to create one)   

   How  to use HDFS instead of local file 
system(file)?   Which parameter should to set?  
 
   Thanks a lot.
   
   
   Peter Zhang  
   -- 
 Google
 Sent with Airmail  
  


--   
Best Regards
 
 Jeff Zhang   


  

Re: SparkContext SyntaxError: invalid syntax

2016-01-19 Thread Felix Cheung

I have to run this to install the pre-req to get jeykyll build to work, you do 
need the python pygments package:
(I’m on ubuntu)sudo apt-get install ruby ruby-dev make gcc nodejssudo gem 
install jekyll --no-rdoc --no-risudo gem install jekyll-redirect-fromsudo 
apt-get install python-Pygmentssudo apt-get install python-sphinxsudo gem 
install pygments.rb

Hope that helps!If not, I can try putting together doc change but I’d rather 
you could make progress :)





On Mon, Jan 18, 2016 at 6:36 AM -0800, "Andrew Weiner" 
<andrewweiner2...@u.northwestern.edu> wrote:





Hi Felix,

Yeah, when I try to build the docs using jekyll build, I get a LoadError
(cannot load such file -- pygments) and I'm having trouble getting past it
at the moment.

>From what I could tell, this does not apply to YARN in client mode.  I was
able to submit jobs in client mode and they would run fine without using
the appMasterEnv property.  I even confirmed that my environment variables
persisted during the job when run in client mode.  There is something about
YARN cluster mode that uses a different environment (the YARN Application
Master environment) and requires the appMasterEnv property for setting
environment variables.

On Sun, Jan 17, 2016 at 11:37 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Do you still need help on the PR?
> btw, does this apply to YARN client mode?
>
> --
> From: andrewweiner2...@u.northwestern.edu
> Date: Sun, 17 Jan 2016 17:00:39 -0600
> Subject: Re: SparkContext SyntaxError: invalid syntax
> To: cutl...@gmail.com
> CC: user@spark.apache.org
>
>
> Yeah, I do think it would be worth explicitly stating this in the docs.  I
> was going to try to edit the docs myself and submit a pull request, but I'm
> having trouble building the docs from github.  If anyone else wants to do
> this, here is approximately what I would say:
>
> (To be added to
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> )
> "Note: When running Spark on YARN in cluster mode, environment variables
> need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName]
> property in your conf/spark-defaults.conf file.  Environment variables
> that are set in spark-env.sh will not be reflected in the YARN
> Application Master process in cluster mode.  See the YARN-related Spark
> Properties
> <http://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties>
> for more information."
>
> I might take another crack at building the docs myself if nobody beats me
> to this.
>
> Andrew
>
>
> On Fri, Jan 15, 2016 at 5:01 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>
> Glad you got it going!  It's wasn't very obvious what needed to be set,
> maybe it is worth explicitly stating this in the docs since it seems to
> have come up a couple times before too.
>
> Bryan
>
> On Fri, Jan 15, 2016 at 12:33 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> Actually, I just found this [
> https://issues.apache.org/jira/browse/SPARK-1680], which after a bit of
> googling and reading leads me to believe that the preferred way to change
> the yarn environment is to edit the spark-defaults.conf file by adding this
> line:
> spark.yarn.appMasterEnv.PYSPARK_PYTHON/path/to/python
>
> While both this solution and the solution from my prior email work, I
> believe this is the preferred solution.
>
> Sorry for the flurry of emails.  Again, thanks for all the help!
>
> Andrew
>
> On Fri, Jan 15, 2016 at 1:47 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I finally got the pi.py example to run in yarn cluster mode.  This was the
> key insight:
> https://issues.apache.org/jira/browse/SPARK-9229
>
> I had to set SPARK_YARN_USER_ENV in spark-env.sh:
> export SPARK_YARN_USER_ENV="PYSPARK_PYTHON=/home/aqualab/local/bin/python"
>
> This caused the PYSPARK_PYTHON environment variable to be used in my yarn
> environment in cluster mode.
>
> Thank you for all your help!
>
> Best,
> Andrew
>
>
>
> On Fri, Jan 15, 2016 at 12:57 PM, Andrew Weiner <
> andrewweiner2...@u.northwestern.edu> wrote:
>
> I tried playing around with my environment variables, and here is an
> update.
>
> When I run in cluster mode, my environment variables do not persist
> throughout the entire job.
> For example, I tried creating a local copy of HADOOP_CONF_DIR in
> /home//local/etc/hadoop/conf, and then, in spark-env.sh I the
> variable:
> export HADOOP_CONF_DIR=/home//local/etc/hadoop/conf
>
> Later, when we print the environment variables in the python code, I see
> this:
>
> ('HADOOP_CONF_DIR', '/et

Re: cannot coerce class "data.frame" to a DataFrame - with spark R

2016-02-18 Thread Felix Cheung
Doesn't DESeqDataSetFromMatrix work with data.frame only? It wouldn't work with 
Spark's DataFrame - try collect(countMat) and others to convert them into 
data.frame?



_
From: roni 
Sent: Thursday, February 18, 2016 4:55 PM
Subject: cannot coerce class "data.frame" to a DataFrame - with spark R
To:  


   Hi ,    I am trying to convert a bioinformatics R script to use 
spark R. It uses external bioconductor package (DESeq2) so the only conversion 
really I have made is to change the way it reads the input file.   
  When I call my external R library function in DESeq2 I get error 
cannot coerce class "data.frame" to a DataFrame .  
  I am listing my old R code and new spark R code below and the line 
giving problem is in RED.  ORIGINAL R -   library(plyr) 
library(dplyr) library(DESeq2) 
library(pheatmap) library(gplots) library(RColorBrewer) 
library(matrixStats) library(pheatmap) 
library(ggplot2) library(hexbin) library(corrplot)  
   
 sampleDictFile <- "/160208.txt" sampleDict <- 
read.table(sampleDictFile) 
 peaks <- read.table("/Atlas.txt") countMat <- 
read.table("/cntMatrix.txt", header = TRUE, sep = "\t") 
 colnames(countMat) <- sampleDict$sample 
rownames(peaks) <- rownames(countMat) <- paste0(peaks$seqnames, ":", 
peaks$start, "-", peaks$end, "  ", peaks$symbol) peaks$id <- 
rownames(peaks)  #  SPARK R 
CODE   peaks <- (read.csv("/Atlas.txt",header = TRUE, sep = 
"\t")))sampleDict<- (read.csv("/160208.txt",header = TRUE, sep 
= "\t", stringsAsFactors = FALSE))countMat<-  
(read.csv("/cntMatrix.txt",header = TRUE, sep = "\t"))  
--- 
 COMMON CODE  for both -  
   countMat <- countMat[, sampleDict$sample]   colData 
<- sampleDict[,"group", drop = FALSE]   design <- ~ group   
  
    dds <- DESeqDataSetFromMatrix(countData = countMat, colData 
= colData, design = design) 
  This line gives error - dds <- 
DESeqDataSetFromMatrix(countData = countMat, colData =  (colData), design = 
design)  Error in DataFrame(colData, row.names = rownames(colData)) :   
  cannot coerce class "data.frame" to a DataFrame  
  I tried as.data.frame or using DataFrame to wrap the defs , but no 
luck.  What Can I do differently?  
  Thanks  Roni  
   


  

Re: installing packages with pyspark

2016-03-19 Thread Felix Cheung
You are running pyspark in Spark client deploy mode. I have ran into the same 
error as well and I'm not sure if this is graphframes specific - the python 
process can't find the graphframes Python code when it is loaded as a Spark 
package.
To workaround this, I extract the graphframes Python directory locally where I 
run pyspark into a directory called graphframes.






On Thu, Mar 17, 2016 at 10:11 PM -0700, "Franc Carter" <franc.car...@gmail.com> 
wrote:





I'm having trouble with that for pyspark, yarn and graphframes. I'm using:-

pyspark --master yarn --packages graphframes:graphframes:0.1.0-spark1.5

which starts and gives me a REPL, but when I try

   from graphframes import *

I get

  No module names graphframes

without '--master yarn' it works as expected

thanks


On 18 March 2016 at 12:59, Felix Cheung <felixcheun...@hotmail.com> wrote:

> For some, like graphframes that are Spark packages, you could also use
> --packages in the command line of spark-submit or pyspark. See
> http://spark.apache.org/docs/latest/submitting-applications.html
>
> _
> From: Jakob Odersky <ja...@odersky.com>
> Sent: Thursday, March 17, 2016 6:40 PM
> Subject: Re: installing packages with pyspark
> To: Ajinkya Kale <kaleajin...@gmail.com>
> Cc: <user@spark.apache.org>
>
>
>
> Hi,
> regarding 1, packages are resolved locally. That means that when you
> specify a package, spark-submit will resolve the dependencies and
> download any jars on the local machine, before shipping* them to the
> cluster. So, without a priori knowledge of dataproc clusters, it
> should be no different to specify packages.
>
> Unfortunatly I can't help with 2.
>
> --Jakob
>
> *shipping in this case means making them available via the network
>
> On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale <kaleajin...@gmail.com>
> wrote:
> > Hi all,
> >
> > I had couple of questions.
> > 1. Is there documentation on how to add the graphframes or any other
> package
> > for that matter on the google dataproc managed spark clusters ?
> >
> > 2. Is there a way to add a package to an existing pyspark context
> through a
> > jupyter notebook ?
> >
> > --aj
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>


--
Franc


Re: installing packages with pyspark

2016-03-19 Thread Felix Cheung
For some, like graphframes that are Spark packages, you could also use 
--packages in the command line of spark-submit or pyspark. 
Seehttp://spark.apache.org/docs/latest/submitting-applications.html

_
From: Jakob Odersky 
Sent: Thursday, March 17, 2016 6:40 PM
Subject: Re: installing packages with pyspark
To: Ajinkya Kale 
Cc:  


   Hi,   
 regarding 1, packages are resolved locally. That means that when you   
 specify a package, spark-submit will resolve the dependencies and   
 download any jars on the local machine, before shipping* them to the   
 cluster. So, without a priori knowledge of dataproc clusters, it   
 should be no different to specify packages.   

 Unfortunatly I can't help with 2.   

 --Jakob   

 *shipping in this case means making them available via the network   

 On Thu, Mar 17, 2016 at 5:36 PM, Ajinkya Kale  wrote:   
 > Hi all,   
 >   
 > I had couple of questions.   
 > 1. Is there documentation on how to add the graphframes or any other package 
 >   
 > for that matter on the google dataproc managed spark clusters ?   
 >   
 > 2. Is there a way to add a package to an existing pyspark context through a  
 >  
 > jupyter notebook ?   
 >   
 > --aj   

 -   
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org   
 For additional commands, e-mail: user-h...@spark.apache.org   

   


  

Re: GraphFrames and IPython notebook issue - No module named graphframes

2016-04-30 Thread Felix Cheung
Please see
http://stackoverflow.com/questions/36397136/importing-pyspark-packages





On Mon, Apr 25, 2016 at 2:39 AM -0700, "Camelia Elena Ciolac" 
 wrote:





Hello,


I work locally on my laptop, not using DataBricks Community edition.


I downloaded  graphframes-0.1.0-spark1.6.jar from 
http://spark-packages.org/package/graphframes/graphframes

and placed it in a folder  named spark_extra_jars where I have other jars too.


After executing in a terminal:


ipython notebook --profile = nbserver



I open in the browser http://127.0.0.1:/ and in my IPython notebook I have, 
among others :


jar_path = 
'/home/camelia/spark_extra_jars/spark-csv_2.11-1.2.0.jar,/home/camelia/spark_extra_jars/commons-csv-1.2.jar,/home/camelia/spark_extra_jars/graphframes-0.1.0-spark1.6.jar,/home/camelia/spark_extra_jars/spark-mongodb_2.10-0.11.0.jar'


config = 
SparkConf().setAppName("graph_analytics").setMaster("local[4]").set("spark.jars",
 jar_path)

I can successfully import the other modules, but when I do

import graphframes

It gives the error:


ImportError   Traceback (most recent call last)
 in ()
> 1 import graphframes

ImportError: No module named graphframes



Thank you in advance for any hint on how to import graphframes successfully.

Best regards,
Camelia


Re: XLConnect in SparkR

2016-07-20 Thread Felix Cheung
>From looking at be CLConnect package, its loadWorkbook() function only 
>supports reading from local file path, so you might need a way to call HDFS 
>command to get the file from HDFS first.

SparkR currently does not support this - you could read it in as a text file (I 
don't think .xlsx is a text format though), collect to get all the data at the 
driver, then save to local path perhaps?





On Wed, Jul 20, 2016 at 3:48 AM -0700, "Rabin Banerjee" 
> wrote:

Hi Yogesh ,

  I have never tried reading XLS files using Spark . But I think you can use 
sc.wholeTextFiles  to read the complete xls at once , as xls files are xml 
internally, you need to read them all to parse . Then I think you can use 
apache poi to read them .

Also, you can copy you XLS data to a MS-Access file to access via JDBC ,

Regards,
Rabin Banerjee

On Wed, Jul 20, 2016 at 2:12 PM, Yogesh Vyas 
> wrote:
Hi,

I am trying to load and read excel sheets from HDFS in sparkR using
XLConnect package.
Can anyone help me in finding out how to read xls files from HDFS in sparkR ?

Regards,
Yogesh

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org




Re: Graphframe Error

2016-07-05 Thread Felix Cheung
This could be the workaround:

http://stackoverflow.com/a/36419857




On Tue, Jul 5, 2016 at 5:37 AM -0700, "Arun Patel" 
<arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>> wrote:

Thanks Yanbo and Felix.

I tried these commands on CDH Quickstart VM and also on "Spark 1.6 pre-built 
for Hadoop" version.  I am still not able to get it working.  Not sure what I 
am missing.  Attaching the logs.




On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
It looks like either the extracted Python code is corrupted or there is a 
mismatch Python version. Are you using Python 3?


stackoverflow.com/questions/514371/whats-the-bad-magic-number-error<http://stackoverflow.com/questions/514371/whats-the-bad-magic-number-error>





On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" 
<yblia...@gmail.com<mailto:yblia...@gmail.com>> wrote:

Hi Arun,

The command

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

will automatically load the required graphframes jar file from maven 
repository, it was not affected by the location where the jar file was placed. 
Your examples works well in my laptop.

Or you can use try with


bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar

to launch PySpark with graphframes enabled. You should set "--py-files" and 
"--jars" options with the directory where you saved graphframes.jar.

Thanks
Yanbo


2016-07-03 15:48 GMT-07:00 Arun Patel 
<arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>>:
I started my pyspark shell with command  (I am using spark 1.6).

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

I have copied 
http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
 to the lib directory of Spark as well.

I was getting below error

>>> from graphframes import *
Traceback (most recent call last):
  File "", line 1, in 
zipimport.ZipImportError: can't find module 'graphframes'
>>>

So, as per suggestions from similar questions, I have extracted the graphframes 
python directory and copied to the local directory where I am running pyspark.

>>> from graphframes import *

But, not able to create the GraphFrame

>>> g = GraphFrame(v, e)
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'GraphFrame' is not defined

Also, I am getting below error.
>>> from graphframes.examples import Graphs
Traceback (most recent call last):
  File "", line 1, in 
ImportError: Bad magic number in graphframes/examples.pyc

Any help will be highly appreciated.

- Arun




Re: Graphframe Error

2016-07-08 Thread Felix Cheung
I ran it with Python 2.





On Thu, Jul 7, 2016 at 4:13 AM -0700, "Arun Patel" 
<arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>> wrote:

I have tied this already.  It does not work.

What version of Python is needed for this package?

On Wed, Jul 6, 2016 at 12:45 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
This could be the workaround:

http://stackoverflow.com/a/36419857




On Tue, Jul 5, 2016 at 5:37 AM -0700, "Arun Patel" 
<arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>> wrote:

Thanks Yanbo and Felix.

I tried these commands on CDH Quickstart VM and also on "Spark 1.6 pre-built 
for Hadoop" version.  I am still not able to get it working.  Not sure what I 
am missing.  Attaching the logs.




On Mon, Jul 4, 2016 at 5:33 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
It looks like either the extracted Python code is corrupted or there is a 
mismatch Python version. Are you using Python 3?


stackoverflow.com/questions/514371/whats-the-bad-magic-number-error<http://stackoverflow.com/questions/514371/whats-the-bad-magic-number-error>





On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" 
<yblia...@gmail.com<mailto:yblia...@gmail.com>> wrote:

Hi Arun,

The command

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

will automatically load the required graphframes jar file from maven 
repository, it was not affected by the location where the jar file was placed. 
Your examples works well in my laptop.

Or you can use try with


bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar

to launch PySpark with graphframes enabled. You should set "--py-files" and 
"--jars" options with the directory where you saved graphframes.jar.

Thanks
Yanbo


2016-07-03 15:48 GMT-07:00 Arun Patel 
<arunp.bigd...@gmail.com<mailto:arunp.bigd...@gmail.com>>:
I started my pyspark shell with command  (I am using spark 1.6).

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

I have copied 
http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
 to the lib directory of Spark as well.

I was getting below error

>>> from graphframes import *
Traceback (most recent call last):
  File "", line 1, in 
zipimport.ZipImportError: can't find module 'graphframes'
>>>

So, as per suggestions from similar questions, I have extracted the graphframes 
python directory and copied to the local directory where I am running pyspark.

>>> from graphframes import *

But, not able to create the GraphFrame

>>> g = GraphFrame(v, e)
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'GraphFrame' is not defined

Also, I am getting below error.
>>> from graphframes.examples import Graphs
Traceback (most recent call last):
  File "", line 1, in 
ImportError: Bad magic number in graphframes/examples.pyc

Any help will be highly appreciated.

- Arun





Re: SparkR error when repartition is called

2016-08-09 Thread Felix Cheung
I think it's saying a string isn't being sent properly from the JVM side.

Does it work for you if you change the dapply UDF to something simpler?

Do you have any log from YARN?


_
From: Shane Lee 
>
Sent: Tuesday, August 9, 2016 12:19 AM
Subject: Re: SparkR error when repartition is called
To: Sun Rui >
Cc: User >


Sun,

I am using spark in yarn client mode in a 2-node cluster with hadoop-2.7.2. My 
R version is 3.3.1.

I have the following in my spark-defaults.conf:
spark.executor.extraJavaOptions =-XX:+PrintGCDetails 
-XX:+HeapDumpOnOutOfMemoryError
spark.r.command=c:/R/R-3.3.1/bin/x64/Rscript
spark.ui.killEnabled=true
spark.executor.instances = 3
spark.serializer = org.apache.spark.serializer.KryoSerializer
spark.shuffle.file.buffer = 1m
spark.driver.maxResultSize=0
spark.executor.memory=8g
spark.executor.cores = 6

I also ran into some other R errors that I was able to bypass by modifying the 
worker.R file (attached). In a nutshell I was getting the "argument is length 
of zero" error sporadically so I put in extra checks for it.

Thanks,

Shane

On Monday, August 8, 2016 11:53 PM, Sun Rui 
> wrote:


I can't reproduce your issue with len=1 in local mode.
Could you give more environment information?
On Aug 9, 2016, at 11:35, Shane Lee 
> wrote:

Hi All,

I am trying out SparkR 2.0 and have run into an issue with repartition.

Here is the R code (essentially a port of the pi-calculating scala example in 
the spark package) that can reproduce the behavior:

schema <- structType(structField("input", "integer"),
structField("output", "integer"))

library(magrittr)

len = 3000
data.frame(n = 1:len) %>%
as.DataFrame %>%
SparkR:::repartition(10L) %>%
dapply(., function (df)
{
library(plyr)
ddply(df, .(n), function (y)
{
data.frame(z =
{
x1 = runif(1) * 2 - 1
y1 = runif(1) * 2 - 1
z = x1 * x1 + y1 * y1
if (z < 1)
{
1L
}
else
{
0L
}
})
})
}
, schema
) %>%
SparkR:::summarize(total = sum(.$output)) %>% collect * 4 / len

For me it runs fine as long as len is less than 5000, otherwise it errors out 
with the following message:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in 
stage 56.0 failed 4 times, most recent failure: Lost task 6.3 in stage 56.0 
(TID 899, LARBGDV-VM02): org.apache.spark.SparkException: R computation failed 
with
 Error in readBin(con, raw(), stringLen, endian = "big") :
  invalid 'n' argument
Calls:  -> readBin
Execution halted
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:59)
at 
org.apache.spark.sql.execution.r.MapPartitionsRWrapper.apply(MapPartitionsRWrapper.scala:29)
at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:178)
at 
org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:175)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$

If the repartition call is removed, it runs fine again, even with very large 
len.

After looking through the documentations and searching the web, I can't seem to 
find any clues how to fix this. Anybody has seen similary problem?

Thanks in advance for your help.

Shane








Re: Graphframe Error

2016-07-04 Thread Felix Cheung
It looks like either the extracted Python code is corrupted or there is a 
mismatch Python version. Are you using Python 3?


stackoverflow.com/questions/514371/whats-the-bad-magic-number-error





On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang" 
> wrote:

Hi Arun,

The command

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

will automatically load the required graphframes jar file from maven 
repository, it was not affected by the location where the jar file was placed. 
Your examples works well in my laptop.

Or you can use try with


bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar

to launch PySpark with graphframes enabled. You should set "--py-files" and 
"--jars" options with the directory where you saved graphframes.jar.

Thanks
Yanbo


2016-07-03 15:48 GMT-07:00 Arun Patel 
>:
I started my pyspark shell with command  (I am using spark 1.6).

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

I have copied 
http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
 to the lib directory of Spark as well.

I was getting below error

>>> from graphframes import *
Traceback (most recent call last):
  File "", line 1, in 
zipimport.ZipImportError: can't find module 'graphframes'
>>>

So, as per suggestions from similar questions, I have extracted the graphframes 
python directory and copied to the local directory where I am running pyspark.

>>> from graphframes import *

But, not able to create the GraphFrame

>>> g = GraphFrame(v, e)
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'GraphFrame' is not defined

Also, I am getting below error.
>>> from graphframes.examples import Graphs
Traceback (most recent call last):
  File "", line 1, in 
ImportError: Bad magic number in graphframes/examples.pyc

Any help will be highly appreciated.

- Arun



Re: UDF in SparkR

2016-08-17 Thread Felix Cheung
This is supported in Spark 2.0.0 as dapply and gapply. Please see the API doc:
https://spark.apache.org/docs/2.0.0/api/R/

Feedback welcome and appreciated!


_
From: Yogesh Vyas >
Sent: Tuesday, August 16, 2016 11:39 PM
Subject: UDF in SparkR
To: user >


Hi,

Is there is any way of using UDF in SparkR ?

Regards,
Yogesh

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: Examples in graphx

2017-01-29 Thread Felix Cheung
Which graph do you are thinking about?
Here's one for neo4j

https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/


From: Deepak Sharma 
Sent: Sunday, January 29, 2017 4:28:19 AM
To: spark users
Subject: Examples in graphx

Hi There,
Are there any examples of using GraphX along with any graph DB?
I am looking to persist the graph in graph based DB and then read it back in 
spark , process using graphx.

--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Getting exit code of pipe()

2017-02-12 Thread Felix Cheung
I mean if you are running a script instead of exiting with a code it could 
print out something.

Sounds like checkCode is what you want though.


_
From: Xuchen Yao <yaoxuc...@gmail.com<mailto:yaoxuc...@gmail.com>>
Sent: Sunday, February 12, 2017 8:33 AM
Subject: Re: Getting exit code of pipe()
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>


Cool that's exactly what I was looking for! Thanks!

How does one output the status into stdout? I mean, how does one capture the 
status output of pipe() command?

On Sat, Feb 11, 2017 at 9:50 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Do you want the job to fail if there is an error exit code?

You could set checkCode to True
spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe<http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe>

Otherwise maybe you want to output the status into stdout so you could process 
it individually.


_
From: Xuchen Yao <yaoxuc...@gmail.com<mailto:yaoxuc...@gmail.com>>
Sent: Friday, February 10, 2017 11:18 AM
Subject: Getting exit code of pipe()
To: <user@spark.apache.org<mailto:user@spark.apache.org>>



Hello Community,

I have the following Python code that calls an external command:

rdd.pipe('run.sh', env=os.environ).collect()

run.sh can either exit with status 1 or 0, how could I get the exit code from 
Python? Thanks!

Xuchen







Re: Getting exit code of pipe()

2017-02-11 Thread Felix Cheung
Do you want the job to fail if there is an error exit code?

You could set checkCode to True
spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe

Otherwise maybe you want to output the status into stdout so you could process 
it individually.


_
From: Xuchen Yao >
Sent: Friday, February 10, 2017 11:18 AM
Subject: Getting exit code of pipe()
To: >


Hello Community,

I have the following Python code that calls an external command:

rdd.pipe('run.sh', env=os.environ).collect()

run.sh can either exit with status 1 or 0, how could I get the exit code from 
Python? Thanks!

Xuchen




Re: what does dapply actually do?

2017-01-18 Thread Felix Cheung
With Spark, the processing is performed lazily. This means nothing much is 
really happening until you call an "action" - an example that is collect(). 
Another way is to write the output in a distributed manner - see write.df() in 
R.

With SparkR dapply() passing the data from Spark to R to process by your UDF 
could have significant overhead. Could you provide more information on your 
case?


_
From: Xiao Liu1 >
Sent: Wednesday, January 18, 2017 11:30 AM
Subject: what does dapply actually do?
To: >



Hi,
I'm really new and trying to learn sparkR. I have defined a relatively 
complicated user-defined function, and use dapply() to apply the function on a 
SparkDataFrame. It was very fast. But I am not sure what has actually been done 
by dapply(). Because when I used collect() to see the output, which is very 
simple, it took a long time to get the result. I suppose maybe I don't need to 
use collect(), but without using it, how can I output the final results, say, 
in a .csv file?
Thank you very much for the help.

Best Regards,
Xiao


[Inactive hide details for Ninad Shringarpure ---01/18/2017 02:24:08 PM---Hi 
Team, Is there a standard way of generating a uniqu]Ninad Shringarpure 
---01/18/2017 02:24:08 PM---Hi Team, Is there a standard way of generating a 
unique id for each row in from

From: Ninad Shringarpure >
To: user >
Date: 01/18/2017 02:24 PM
Subject: Creating UUID using SparksSQL





Hi Team,

Is there a standard way of generating a unique id for each row in from Spark 
SQL. I am looking for functionality similar to UUID generation in hive.

Let me know if you need any additional information.

Thanks,
Ninad






Re: Creating UUID using SparksSQL

2017-01-18 Thread Felix Cheung
spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id

?



From: Ninad Shringarpure 
Sent: Wednesday, January 18, 2017 11:23:15 AM
To: user
Subject: Creating UUID using SparksSQL

Hi Team,

Is there a standard way of generating a unique id for each row in from Spark 
SQL. I am looking for functionality similar to UUID generation in hive.

Let me know if you need any additional information.

Thanks,
Ninad


Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Felix Cheung
I think lots of components expect to have read/write permission to a /tmp 
directory on HDFS.

Glad it works out!


_
From: Andy Davidson 
<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>
Sent: Thursday, August 18, 2016 5:12 PM
Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
user @spark <user@spark.apache.org<mailto:user@spark.apache.org>>



NICE CATCH!!! Many thanks.


I spent all day on this bug


The error msg report /tmp. I did not think to look on hdfs.


[ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/

Found 1 items

-rw-r--r--   3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp

[ec2-user@ip-172-31-22-140 notebooks]$


I have no idea how hdfs:///tmp got created. I deleted it.

This causes a bunch of exceptions. These exceptions has useful message. I was 
able to fix the problem as follows

$ hadoop fs -rmr hdfs:///tmp

Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong

$ hadoop fs -chmod 777 hdfs:///tmp/hive


From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Date: Thursday, August 18, 2016 at 3:37 PM
To: Andrew Davidson 
<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>, "user 
@spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp

Do you have a file called tmp at / on HDFS?





On Thu, Aug 18, 2016 at 2:57 PM -0700, "Andy Davidson" 
<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>> wrote:

For unknown reason I can not create UDF when I run the attached notebook on my 
cluster. I get the following error


Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp

The notebook runs fine on my Mac

In general I am able to run non UDF spark code with out any trouble

I start the notebook server as the user “ec2-user" and uses master URL
spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066



I found the following message in the notebook server log file. I have log level 
set to warn


16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0

16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException


The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2



#from pyspark.sql import SQLContext, HiveContext

#sqlContext = SQLContext(sc)

​

#from pyspark.sql import DataFrame

#from pyspark.sql import functions

​

from pyspark.sql.types import StringType

from pyspark.sql.functions import udf

​

print("spark version: {}".format(sc.version))

​

import sys

print("python version: {}".format(sys.version))

spark version: 1.6.1python version: 3.4.3 (default, Apr  1 2015, 18:10:40) [GCC 
4.8.2 20140120 (Red Hat 4.8.2-16)]



# functions.lower() raises # py4j.Py4JException: Method lower([class 
java.lang.String]) does not exist# work around define a UDFtoLowerUDFRetType = 
StringType()#toLowerUDF = udf(lambda s : s.lower(), 
toLowerUDFRetType)toLowerUDF = udf(lambda s : s.lower(), StringType())

You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly

Py4JJavaErrorTraceback (most recent call last) in 
()  4 toLowerUDFRetType = StringType()  5 #toLowerUDF = 
udf(lambda s : s.lower(), toLowerUDFRetType)> 6 toLowerUDF = udf(lambda s : 
s.lower(), StringType())/root/spark/python/pyspark/sql/functions.py in udf(f, 
returnType)   1595 [Row(slen=5), Row(slen=3)]   1596 """-> 1597 
return UserDefinedFunction(f, returnType)   15981599 blacklist = ['map', 
'since', 'ignore_unicode_prefix']/root/spark/python/pyspark/sql/functions.py in 
__init__(self, func, returnType, name)   1556 self.returnType = 
returnType   1557 self._broadcast = None-> 1558 self._judf = 
self._create_judf(name)   15591560 def _create_judf(self, 
name):/root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)   
1567 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command, self)   1568 ctx = 
SQLContext.getOrCreate(sc)-> 1569 jdt = 
ctx._ssql_ctx.parseDataType(self.returnType.json())   1570 if name is 
None:   1571 na

Re: pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Felix Cheung
Do you have a file called tmp at / on HDFS?





On Thu, Aug 18, 2016 at 2:57 PM -0700, "Andy Davidson" 
> wrote:

For unknown reason I can not create UDF when I run the attached notebook on my 
cluster. I get the following error


Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: 
Parent path is not a directory: /tmp tmp

The notebook runs fine on my Mac

In general I am able to run non UDF spark code with out any trouble

I start the notebook server as the user “ec2-user" and uses master URL
spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066



I found the following message in the notebook server log file. I have log level 
set to warn


16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0

16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException


The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2



#from pyspark.sql import SQLContext, HiveContext

#sqlContext = SQLContext(sc)

​

#from pyspark.sql import DataFrame

#from pyspark.sql import functions

​

from pyspark.sql.types import StringType

from pyspark.sql.functions import udf

​

print("spark version: {}".format(sc.version))

​

import sys

print("python version: {}".format(sys.version))

spark version: 1.6.1
python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]



# functions.lower() raises
# py4j.Py4JException: Method lower([class java.lang.String]) does not exist
# work around define a UDF
toLowerUDFRetType = StringType()
#toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
toLowerUDF = udf(lambda s : s.lower(), StringType())


You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly


Py4JJavaErrorTraceback (most recent call last)
 in ()
  4 toLowerUDFRetType = StringType()
  5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> 6 toLowerUDF = udf(lambda s : s.lower(), StringType())

/root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
   1595 [Row(slen=5), Row(slen=3)]
   1596 """
-> 1597 return UserDefinedFunction(f, returnType)
   1598
   1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']

/root/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, 
name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559
   1560 def _create_judf(self, name):

/root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__

/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
690
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693
694 def refreshTable(self, tableName):

/root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, 
*args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065
   1066 for temp_arg in temp_args:

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/root/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: 
Parent path is not a directory: /tmp tmp
at 

Re: Best way to read XML data from RDD

2016-08-19 Thread Felix Cheung
Have you tried

https://github.com/databricks/spark-xml
?




On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" 
> wrote:

Hi,

There is a RDD with json data. I could read json data using rdd.read.json . The 
json data has XML data in couple of key-value paris.

Which is the best method to read and parse XML from rdd. Is there any specific 
xml libraries for spark. Could anyone help on this.

Thanks.


Re: Best way to read XML data from RDD

2016-08-19 Thread Felix Cheung
Ah. Have you tried Jackson?
https://github.com/FasterXML/jackson-dataformat-xml/blob/master/README.md


_
From: Diwakar Dhanuskodi 
<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>
Sent: Friday, August 19, 2016 9:41 PM
Subject: Re: Best way to read XML data from RDD
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
user <user@spark.apache.org<mailto:user@spark.apache.org>>


Yes . It accepts a xml file as source but not RDD. The XML data embedded  
inside json is streamed from kafka cluster.  So I could get it as RDD.
Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map 
function  but  performance  wise I am not happy as it takes 4 minutes to parse 
XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.


Sent from Samsung Mobile.


---- Original message 
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Date:20/08/2016 09:49 (GMT+05:30)
To: Diwakar Dhanuskodi 
<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Cc:
Subject: Re: Best way to read XML data from RDD

Have you tried

https://github.com/databricks/spark-xml
?




On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar 
Dhanuskodi"<diwakar.dhanusk...@gmail.com<mailto:diwakar.dhanusk...@gmail.com>> 
wrote:

Hi,

There is a RDD with json data. I could read json data using rdd.read.json . The 
json data has XML data in couple of key-value paris.

Which is the best method to read and parse XML from rdd. Is there any specific 
xml libraries for spark. Could anyone help on this.

Thanks.




Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>, Felix 
Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment ‘<<-‘ 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv(‘file.csv’)
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution /Data Science
Mobile:+39.329.17.62.539/www.neustar.biz<http://www.neustar.biz/>
Reduceyour environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_pages_NeuStar_104072179630456-3Ffref-3Dts=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=kTklp0PwiGNOEuGCv372Uvx3gC_8jom2kpMSDkt1i6U=>
   [New%20Picture%20(1)(1)]  
LinkedIn<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_5349-3Ftrk-3Dtyah-26trkInfo-3DclickedVertical-253Acompany-252CclickedEntityId-253A5349-252Cidx-253A2-2D1-2D4-252CtarId-253A1450369757393-252Ctas-253Aneustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=9N3DRk8Hdq-pUlGXTaUx6fpdayRdhW66Su_NMiSTR2Q=>
   [New%20Picture%20(2)]  
Twitter<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_Neustar=DQMFAg=MOptNlVtIETeDALC_lULrw=3gXtazXocjhQ4zuUNllnnttMoPLZDfqBTi42s_2XqUY=yceEWMjpUYWGlvL0Alf3CH6um6E6ecHcnX_iH3b3WW8=hp6UhqxuA6vRj6lchMSqS0AT_NKE-HGDLDC0aYhEGJ4=>
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the in

Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Felix Cheung
There is an Anaconda parcel one could readily install on CDH

https://docs.continuum.io/anaconda/cloudera

As Sean says it is Python 2.7.x.

Spark should work for both 2.7 and 3.5.

_
From: Sean Owen >
Sent: Friday, September 2, 2016 12:41 AM
Subject: Re: PySpark: preference for Python 2.7 or Python 3.5?
To: Ian Stokes Rees >
Cc: user @spark >


Spark should work fine with Python 3. I'm not a Python person, but all else 
equal I'd use 3.5 too. I assume the issue could be libraries you want that 
don't support Python 3. I don't think that changes with CDH. It includes a 
version of Anaconda from Continuum, but that lays down Python 2.7.11. I don't 
believe there's any particular position on 2 vs 3.

On Fri, Sep 2, 2016 at 3:56 AM, Ian Stokes Rees 
> wrote:
I have the option of running PySpark with Python 2.7 or Python 3.5.  I am 
fairly expert with Python and know the Python-side history of the differences.  
All else being the same, I have a preference for Python 3.5.  I'm using CDH 5.8 
and I'm wondering if that biases whether I should proceed with PySpark on top 
of Python 2.7 or 3.5.  Opinions?  Does Cloudera have an official (or 
unofficial) position on this?

Thanks,

Ian
___
Ian Stokes-Rees
Computational Scientist

[Continuum Analytics]
@ijstokes [Twitter]   [LinkedIn] 
  [Github]   
617.942.0218





Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-10 Thread Felix Cheung
You should be able to get it to work with 2.0 as uber jar.

What type cluster you are running on? YARN? And what distribution?





On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" 
> wrote:

You really shouldn't mix different versions of Spark between the master and 
worker nodes, if your going to upgrade - upgrade all of them. Otherwise you may 
get very confusing failures.

On Monday, September 5, 2016, Rex X 
> wrote:
Wish to use the Pivot Table feature of data frame which is available since 
Spark 1.6. But the spark of current cluster is version 1.5. Can we install 
Spark 2.0 on the master node to work around this?

Thanks!


--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau



Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
How are you calling dirs()? What would be x? Is dat a SparkDataFrame?

With SparkR, i in dat[i, 4] should be an logical expression for row, eg. 
df[df$age %in% c(19, 30), 1:2]





On Sat, Sep 10, 2016 at 11:02 AM -0700, "Bene" 
> wrote:

Here are a few code snippets:

The data frame looks like this:

kfzzeit   datum
latitude longitude
1 # 2015-02-09 07:18:33 2015-02-09 52.35234  9.881965
2 # 2015-02-09 07:18:34 2015-02-09 52.35233  9.881970
3 # 2015-02-09 07:18:35 2015-02-09 52.35232  9.881975
4 # 2015-02-09 07:18:36 2015-02-09 52.35232  9.881972
5 # 2015-02-09 07:18:37 2015-02-09 52.35231  9.881973
6 # 2015-02-09 07:18:38 2015-02-09 52.35231  9.881978

I call this function with a number (position in the data frame) and a data
frame:

dirs <- function(x, dat){
  direction(startLat = dat[x,4], endLat = dat[x+1,4], startLon = dat[x,5],
endLon = dat[x+1,5])
}

Here I get the error with the S4 class not subsettable. This function calls
another function which does the actual calculation:

direction <- function(startLat, endLat, startLon, endLon){
  startLat <- degrees.to.radians(startLat);
  startLon <- degrees.to.radians(startLon);
  endLat <- degrees.to.radians(endLat);
  endLon <- degrees.to.radians(endLon);
  dLon <- endLon - startLon;

  dPhi <- log(tan(endLat / 2 + pi / 4) / tan(startLat / 2 + pi / 4));
  if (abs(dLon) > pi) {
if (dLon > 0) {
  dLon <- -(2 * pi - dLon);
} else {
  dLon <- (2 * pi + dLon);
}
  }
  bearing <- radians.to.degrees((atan2(dLon, dPhi) + 360 )) %% 360;
  return (bearing);
}


Anything more you need?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-API-problem-with-subsetting-distributed-data-frame-tp27688p27691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Assign values to existing column in SparkR

2016-09-10 Thread Felix Cheung
If you are to set a column to 0 (essentially remove and replace the existing 
one) you would need to put a column on the right hand side:


> df <- as.DataFrame(iris)
> head(df)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> df$Petal_Length <- 0
Error: class(value) == "Column" || is.null(value) is not TRUE
> df$Petal_Length <- lit(0)
> head(df)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 0 0.2 setosa
2 4.9 3.0 0 0.2 setosa
3 4.7 3.2 0 0.2 setosa
4 4.6 3.1 0 0.2 setosa
5 5.0 3.6 0 0.2 setosa
6 5.4 3.9 0 0.4 setosa

_
From: Deepak Sharma >
Sent: Friday, September 9, 2016 12:29 PM
Subject: Re: Assign values to existing column in SparkR
To: xingye >
Cc: >


Data frames are immutable in nature , so i don't think you can directly assign 
or change values on the column.

Thanks
Deepak

On Fri, Sep 9, 2016 at 10:59 PM, xingye 
> wrote:

I have some questions about assign values to a spark dataframe. I want to 
assign values to an existing column of a spark dataframe but if I assign the 
value directly, I got the following error.

  1.  df$c_mon<-0
  2.  Error: class(value) == "Column" || is.null(value) is not TRUE

Is there a way to solve this?



--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net




Re: questions about using dapply

2016-09-10 Thread Felix Cheung
You might need MARGIN capitalized, this example works though:

c <- as.DataFrame(cars)
# rename the columns to c1, c2
c <- selectExpr(c, "speed as c1", "dist as c2")
cols_in <- dapplyCollect(c,
function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ 
y %in% c(61, 99)})})
# dapplyCollect does not require the schema parameter


_
From: xingye >
Sent: Friday, September 9, 2016 10:35 AM
Subject: questions about using dapply
To: >



I have a question about using UDF in SparkR. I'm converting some R code into 
SparkR.


* The original R code is :

cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = 
"%in%", c(61, 99))


* If I use dapply and put the original apply function as a function for dapply,

cols_in <-dapply(df,

function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ 
y %in% c(61, 99)})},

schema )

The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no 
default


* If I use spark.lapply, it still shows the error. It seems in spark, the 
column cr_cd1 is ambiguous.

cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep = "")], function(x){ x 
%in% c(61, 99)})

 16/09/08 ERROR RBackendHandler: select on 3101 failed Error in 
invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, could 
be: cr_cd1#2169L, cr_cd1#17787L.;



  *   If I use dapplycollect, it works but it will lead to memory issue if data 
is big. how can the dapply work in my case?

wrapper = function(df){

out = apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", 
c(61, 99))

return(out)

}

cols_in <-dapplyCollect(df,wrapper)




Re: SparkR error: reference is ambiguous.

2016-09-10 Thread Felix Cheung
Could you provide more information on how df in your example is created?
Also please include the output from printSchema(df)?

This example works:
> c <- createDataFrame(cars)
> c
SparkDataFrame[speed:double, dist:double]
> c$speed <- c$dist*0
> c
SparkDataFrame[speed:double, dist:double]
> head(c)
  speed dist
1 0 2
2 0 10
3 0 4
4 0 22
5 0 16
6 0 10


_
From: Bedrytski Aliaksandr >
Sent: Friday, September 9, 2016 9:13 PM
Subject: Re: SparkR error: reference is ambiguous.
To: xingye >
Cc: >


Hi,

Can you use full-string queries in SparkR?
Like (in Scala):

df1.registerTempTable("df1")
df2.registerTempTable("df2")
val df3 = sparkContext.sql("SELECT * FROM df1 JOIN df2 ON df1.ra = df2.ra")

explicitly mentioning table names in the query often solves ambiguity problems.

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Fri, Sep 9, 2016, at 19:33, xingye wrote:

Not sure whether this is the right distribution list that I can ask questions. 
If not, can someone give a distribution list that can find someone to help?


I kept getting error of reference is ambiguous when implementing some sparkR 
code.


1. when i tried to assign values to a column using the existing column:

df$c_mon<- df$ra*0

  1.  16/09/09 15:11:28 ERROR RBackendHandler: col on 3101 failed
  2.  Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  3.org.apache.spark.sql.AnalysisException: Reference 'ra' is ambiguous, 
could be: ra#8146, ra#13501.;

2. when I joined two spark dataframes using the key:

df3<-join(df1, df2, df1$ra == df2$ra, "left")

  1.  16/09/09 14:48:07 WARN Column: Constructing trivially true equals 
predicate, 'ra#8146 = ra#8146'. Perhaps you need to use aliases.

Actually column "ra" is the column name, I don't know why sparkR keeps having 
errors about ra#8146 or ra#13501..

Can someone help?

Thanks





Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
Could you include code snippets you are running?





On Sat, Sep 10, 2016 at 1:44 AM -0700, "Bene" 
> wrote:

Hi,

I am having a problem with the SparkR API. I need to subset a distributed
data so I can extract single values from it on which I can then do
calculations.

Each row of my df has two integer values, I am creating a vector of new
values calculated as a series of sin, cos, tan functions on these two
values. Does anyone have an idea how to do this in SparkR?

So far I tried subsetting with [], [[]], subset(), but mostly I get the
error

object of type 'S4' is not subsettable

Is there any way to do such a thing in SparkR? Any help would be greatly
appreciated! Also let me know if you need more information, code etc.
Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-API-problem-with-subsetting-distributed-data-frame-tp27688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-18 Thread Felix Cheung
Well, uber jar works in YARN, but not with standalone ;)





On Sun, Sep 18, 2016 at 12:44 PM -0700, "Chris Fregly" 
<ch...@fregly.com<mailto:ch...@fregly.com>> wrote:

you'll see errors like this...

"java.lang.RuntimeException: java.io.InvalidClassException: 
org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream 
classdesc serialVersionUID = -2221986757032131007, local class serialVersionUID 
= -5447855329526097695"

...when mixing versions of spark.

i'm actually seeing this right now while testing across Spark 1.6.1 and Spark 
2.0.1 for my all-in-one, hybrid cloud/on-premise Spark + Zeppelin + Kafka + 
Kubernetes + Docker + One-Click Spark ML Model Production Deployments 
initiative documented here:

https://github.com/fluxcapacitor/pipeline/wiki/Kubernetes-Docker-Spark-ML

and check out my upcoming meetup on this effort either in-person or online:

http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/233978839/

we're throwing in some GPU/CUDA just to sweeten the offering!  :)

On Sat, Sep 10, 2016 at 2:57 PM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
I don't think a 2.0 uber jar will play nicely on a 1.5 standalone cluster.


On Saturday, September 10, 2016, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
You should be able to get it to work with 2.0 as uber jar.

What type cluster you are running on? YARN? And what distribution?





On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" <hol...@pigscanfly.ca> 
wrote:

You really shouldn't mix different versions of Spark between the master and 
worker nodes, if your going to upgrade - upgrade all of them. Otherwise you may 
get very confusing failures.

On Monday, September 5, 2016, Rex X <dnsr...@gmail.com> wrote:
Wish to use the Pivot Table feature of data frame which is available since 
Spark 1.6. But the spark of current cluster is version 1.5. Can we install 
Spark 2.0 on the master node to work around this?

Thanks!


--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau



--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau




--
Chris Fregly
Research Scientist @ PipelineIO<http://pipeline.io>
Advanced Spark and TensorFlow 
Meetup<http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/>
San Francisco | Chicago | Washington DC






Re: No SparkR on Mesos?

2016-09-07 Thread Felix Cheung
This is correct - SparkR is not quite working completely on Mesos. JIRAs and 
contributions welcome!





On Wed, Sep 7, 2016 at 10:21 AM -0700, "Michael Gummelt" 
> wrote:

Quite possibly.  I've never used it.  I know Python was "unsupported" for a 
while, which turned out to mean there was a silly conditional that would fail 
the submission, even though all the support was there.  Could be the same for 
R.  Can you submit a JIRA?

On Wed, Sep 7, 2016 at 5:02 AM, Peter Griessl 
> wrote:
Hello,

does SparkR really not work (yet?) on Mesos (Spark 2.0 on Mesos 1.0)?

$ /opt/spark/bin/sparkR

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
Launching java with spark-submit command /opt/spark/bin/spark-submit   
"sparkr-shell" /tmp/RtmpPYVJxF/backend_port338581f434
Error: SparkR is not supported for Mesos cluster.
Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  :
  JVM is not ready after 10 seconds


I couldn't find any information on this subject in the docs - am I missing 
something?

Thanks for any hints,
Peter



--
Michael Gummelt
Software Engineer
Mesosphere


Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-25 Thread Felix Cheung
The reason your second example works is because of a closure capture behavior. 
It should be ok for a small amount of data.

You could also use SparkR:::broadcast but please keep in mind that is not 
public API we actively support.

Thank you for the information on formula - I will test that out. Please note 
that SparkR code is now at

https://github.com/apache/spark/tree/master/R
_
From: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>
Sent: Thursday, August 25, 2016 6:08 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: <user@spark.apache.org<mailto:user@spark.apache.org>>, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>


I tested both in local and cluster mode and the ‘<<-‘ seemed to work at least 
for small data. Or am I missing something? Is there a way for me to test? If 
that does not work, can I use something like this?

sc <- SparkR:::getSparkContext()
bcStack <- SparkR:::broadcast(sc,stack)

I realized that the error: Error in writeBin(batch, con, endian = "big")

Was due to an object within the ‘parameters’ list which was a R formula.

When the spark.lapply method calls the parallelize method, it splits the list 
and calls the SparkR:::writeRaw method, which tries to convert from formula to 
binary exploding the size of the object being passed.

https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/R/serialize.R

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Thursday, August 25, 2016 2:35 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

Hmm <<-- wouldn't work in cluster mode. Are you running spark in local mode?

In any case, I tried running your earlier code and it worked for me on a 250MB 
csv:

scoreModel <- function(parameters){
   library(data.table) # I assume this should data.table
   dat <- data.frame(fread(“file.csv”))
   score(dat,parameters)
}
parameterList <- lapply(1:100, function(i) getParameters(i))
modelScores <- spark.lapply(parameterList, scoreModel)

Could you provide more information on your actual code?

_
From: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>
Sent: Wednesday, August 24, 2016 10:37 AM
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>, Felix 
Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>, 
<user@spark.apache.org<mailto:user@spark.apache.org>>



Hi Spark experts,

I was able to get around the broadcast issue by using a global assignment ‘<<-‘ 
instead of reading the data locally. However, I still get the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector


Pseudo code:

scoreModel <- function(parameters){
   library(score)
   score(dat,parameters)
}

dat <<- read.csv(‘file.csv’)
modelScores <- spark.lapply(parameterList, scoreModel)

From: Cinquegrana, Piero [mailto:piero.cinquegr...@neustar.biz]
Sent: Tuesday, August 23, 2016 2:39 PM
To: Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

The output from score() is very small, just a float. The input, however, could 
be as big as several hundred MBs. I would like to broadcast the dataset to all 
executors.

Thanks,
Piero

From: Felix Cheung [mailto:felixcheun...@hotmail.com]
Sent: Monday, August 22, 2016 10:48 PM
To: Cinquegrana, Piero 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>>;user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = 
"big")

How big is the output from score()?

Also could you elaborate on what you want to broadcast?


On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
<piero.cinquegr...@neustar.biz<mailto:piero.cinquegr...@neustar.biz>> wrote:
Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting t

Re: Disable logger in SparkR

2016-08-22 Thread Felix Cheung
You should be able to do that with log4j.properties
http://spark.apache.org/docs/latest/configuration.html#configuring-logging

Or programmatically
https://spark.apache.org/docs/2.0.0/api/R/setLogLevel.html
_
From: Yogesh Vyas >
Sent: Monday, August 22, 2016 6:12 AM
Subject: Disable logger in SparkR
To: user >


Hi,

Is there any way of disabling the logging on console in SparkR ?

Regards,
Yogesh

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: spark.lapply in SparkR: Error in writeBin(batch, con, endian = "big")

2016-08-22 Thread Felix Cheung
How big is the output from score()?

Also could you elaborate on what you want to broadcast?





On Mon, Aug 22, 2016 at 11:58 AM -0700, "Cinquegrana, Piero" 
> wrote:

Hello,

I am using the new R API in SparkR spark.lapply (spark 2.0). I am defining a 
complex function to be run across executors and I have to send the entire 
dataset, but there is not (that I could find) a way to broadcast the variable 
in SparkR. I am thus reading the dataset in each executor from disk, but I 
getting the following error:

Error in writeBin(batch, con, endian = "big") :
  attempting to add too many elements to raw vector

Any idea why this is happening?

Pseudo code:

scoreModel <- function(parameters){
   library(read.table)
   dat <- data.frame(fread("file.csv"))
   score(dat,parameters)
}

parameterList <- lapply(1:numModels, function(i) getParameters(i))

modelScores <- spark.lapply(parameterList, scoreModel)


Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook   
[New%20Picture%20(1)(1)]  
LinkedIn
   [New%20Picture%20(2)]  Twitter
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



Piero Cinquegrana
MarketShare: A Neustar Solution / Data Science
Mobile: +39.329.17.62.539 / www.neustar.biz
Reduce your environmental footprint. Print only if necessary.
Follow Neustar:   [New%20Picture]  
Facebook   
[New%20Picture%20(1)(1)]  
LinkedIn
   [New%20Picture%20(2)]  Twitter
The information contained in this email message is intended only for the use of 
the recipient(s) named above and may contain confidential and/or privileged 
information. If you are not the intended recipient you have received this email 
message in error and any review, dissemination, distribution, or copying of 
this message is strictly prohibited. If you have received this communication in 
error, please notify us immediately and delete the original message.



Re: Issue Running sparkR on YARN

2016-11-09 Thread Felix Cheung
It maybe the Spark executor is running as a different user and it can't see 
where RScript is?

You might want to try putting Rscript path to PATH.

Also please see this for the config property to set for the R command to use:
https://spark.apache.org/docs/latest/configuration.html#sparkr



_
From: ian.malo...@tdameritrade.com
Sent: Wednesday, November 9, 2016 12:12 PM
Subject: Issue Running sparkR on YARN
To: >


Hi,

I'm trying to run sparkR (1.5.2) on YARN and I get:

java.io.IOException: Cannot run program "Rscript": error=2, No such file or 
directory

This strikes me as odd, because I can go to each node and various users and 
type Rscript and it works. I've done this on each node and spark-env.sh as 
well: export R_HOME=/path/to/R

This is how I'm setting it on the nodes (/etc/profile.d/path_edit.sh):

export R_HOME=/app/hdp_app/anaconda/bin/R
PATH=$PATH:/app/hdp_app/anaconda/bin

Any ideas?

Thanks,

Ian

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: Strongly Connected Components

2016-11-10 Thread Felix Cheung
It is possible it is dead. Could you check the Spark UI to see if there is any 
progress?


_
From: Shreya Agarwal >
Sent: Thursday, November 10, 2016 12:45 AM
Subject: RE: Strongly Connected Components
To: >


Bump. Anyone? Its been running for 10 hours now. No results.

From: Shreya Agarwal
Sent: Tuesday, November 8, 2016 9:05 PM
To: user@spark.apache.org
Subject: Strongly Connected Components

Hi,

I am running this on a graph with >5B edges and >3B edges and have 2 questions -


  1.  What is the optimal number of iterations?
  2.  I am running it for 1 iteration right now on a beefy 100 node cluster, 
with 300 executors each having 30GB RAM and 5 cores. I have persisted the graph 
to MEMORY_AND_DISK. And it has been running for 3 hours already. Any ideas on 
how to speed this up?

Regards,
Shreya




Re: Substitute Certain Rows a data Frame using SparkR

2016-10-19 Thread Felix Cheung
It's a bit less concise but this works:

> a <- as.DataFrame(cars)
> head(a)
  speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10

> b <- withColumn(a, "speed", ifelse(a$speed > 15, a$speed, 3))
> head(b)
  speed dist
1 3 2
2 3 10
3 3 4
4 3 22
5 3 16
6 3 10

I think your example could be something we support though. Please feel free to 
open a JIRA for that.
_
From: shilp >
Sent: Monday, October 17, 2016 7:38 AM
Subject: Substitute Certain Rows a data Frame using SparkR
To: >


I have a sparkR Data frame and I want to Replace certain Rows of a Column which 
satisfy certain condition with some value.If it was a simple R data frame then 
I would do something as follows:df$Column1[df$Column1 == "Value"] = "NewValue" 
How would i perform similar operation on a SparkR data frame. ??

View this message in context: Substitute Certain Rows a data Frame using 
SparkR
Sent from the Apache Spark User List mailing list 
archive at 
Nabble.com.




Re: How to propagate R_LIBS to sparkr executors

2016-11-17 Thread Felix Cheung
Have you tried
spark.executorEnv.R_LIBS?

spark.apache.org/docs/latest/configuration.html#runtime-environment

_
From: Rodrick Brown >
Sent: Wednesday, November 16, 2016 1:01 PM
Subject: How to propagate R_LIBS to sparkr executors
To: >


I'm having an issue with a R module not getting picked up on the slave nodes in 
mesos. I have the following environment value R_LIBS set and for some reason 
this environment is only set in the driver context and not the executor is 
their a way to pass environment values down the executor level in sparkr?

I'm using Mesos 1.0.1 and Spark 2.0.1

Thanks.


--

[Orchard Platform]

Rodrick Brown / Site Reliability Engineer
+1 917 445 6839 / 
rodr...@orchardplatform.com

Orchard Platform
101 5th Avenue, 4th Floor, New York, NY 10003
http://www.orchardplatform.com

Orchard Blog | Marketplace Lending 
Meetup


NOTICE TO RECIPIENTS: This communication is confidential and intended for the 
use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an offer to 
sell or a solicitation of an indication of interest to purchase any loan, 
security or any other financial product or instrument, nor is it an offer to 
sell or a solicitation of an indication of interest to purchase any products or 
services to any persons who are prohibited from receiving such information 
under applicable law. The contents of this communication may not be accurate or 
complete and are subject to change without notice. As such, Orchard App, Inc. 
(including its subsidiaries and affiliates, "Orchard") makes no representation 
regarding the accuracy or completeness of the information contained herein. The 
intended recipient is advised to consult its own professional advisors, 
including those specializing in legal, tax and accounting matters. Orchard does 
not provide legal, tax or accounting advice.




Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread Felix Cheung
What is the format?



From: KhajaAsmath Mohammed 
Sent: Thursday, December 15, 2016 7:54:27 PM
To: user @spark
Subject: Spark Dataframe: Save to hdfs is taking long time

Hi,

I am using issue while saving the dataframe back to HDFS. It's taking long time 
to run.


val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable 
pt,ClusterTempTable ct,GamificationTempTable gt where gt.vin=pt.vin and 
pt.cluster=ct.cluster")
results_dataframe.coalesce(numPartitions)
results_dataframe.persist(StorageLevel.MEMORY_AND_DISK)

dataFrame.write.mode(saveMode).format(format)
  .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec"
  .save(outputPath)

It was taking long time and total number of records for  this dataframe is 
4903764

I even increased number of partitions from 10 to 20, still no luck. Can anyone 
help me in resolving this performance issue

Thanks,

Asmath


Re: How to load edge with properties file useing GraphX

2016-12-15 Thread Felix Cheung
Have you checked out https://github.com/graphframes/graphframes?

It might be easier to work with DataFrame.



From: zjp_j...@163.com 
Sent: Thursday, December 15, 2016 7:23:57 PM
To: user
Subject: How to load edge with properties file useing GraphX

Hi,
   I want to load a edge file  and vertex attriInfos file as follow ,how can i 
use these two files create Graph ?
  edge file -> "SrcId,DestId,propertis...  "   vertex attriInfos file-> "VID, 
properties..."

   I learned about there have a GraphLoader object  that can load edge file 
with no properties  and then join Vertex properties to create Graph. So the 
issue is how to then attach edge properties.

   Thanks.


zjp_j...@163.com


Re: [GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Felix Cheung
That's correct - currently GraphFrame does not compute PageRank with weighted 
edges.


_
From: Weiwei Zhang >
Sent: Thursday, December 1, 2016 2:41 PM
Subject: [GraphFrame, Pyspark] Weighted Edge in PageRank
To: user >


Hi guys,

I am trying to compute the pagerank for the locations in the following dummy 
dataframe,

srcdes  shared_gas_stations
 A   B   2
 A   C  10
 C   E   3
 D   E  12
 E   G   5
...

I have tried the function graphframe.pageRank(resetProbability=0.01, 
maxIter=20) in GraphFrame but it seems like this function doesn't take weighted 
edges. Maybe I am not using it correctly. How can I pass the weighted edges to 
this function? Also I am not sure if this function works for the undirected 
graph.


Thanks a lot!

- Weiwei




Re: PySpark to remote cluster

2016-11-30 Thread Felix Cheung
Spark 2.0.1 is running with a different py4j library than Spark 1.6.

You will probably run into other problems mixing versions though - is there a 
reason you can't run Spark 1.6 on the client?


_
From: Klaus Schaefers 
>
Sent: Wednesday, November 30, 2016 2:44 AM
Subject: PySpark to remote cluster
To: >


Hi,

I want to connect with a local Jupyter Notebook to a remote Spark cluster.
The Cluster is running Spark 2.0.1 and the Jupyter notebook is based on
Spark 1.6 and running in a docker image (Link). I try to init the
SparkContext like this:

import pyspark
sc = pyspark.SparkContext('spark://:7077')

However, this gives me the following exception:


ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 746, in send_command
raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 626, in send_command
response = connection.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 750, in send_command
raise Py4JNetworkError("Error while sending or receiving", e)
py4j.protocol.Py4JNetworkError: Error while sending or receiving

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 740, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/opt/conda/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 746, in send_command
raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty

...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.5/site-packages/IPython/utils/PyColorize.py",
line 262, in format2
for atoken in generate_tokens(text.readline):
File "/opt/conda/lib/python3.5/tokenize.py", line 597, in _tokenize
raise TokenError("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))


Is this error caused by the different spark versions?

Best,

Klaus




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-to-remote-cluster-tp28147.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava <ankur.srivast...@gmail.com>
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: user@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code 
"ConnectedComponents.scala:339" I am referring the code 
@https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala

  if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
// TODO: remove this after DataFrame.checkpoint is implemented
val out = s"${checkpointDir.get}/$iteration"
ee.write.parquet(out)
// may hit S3 eventually consistent issue
ee = sqlContext.read.parquet(out)

// remove previous checkpoint
if (iteration > checkpointInterval) {
  FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(s"${checkpointDir.get}/${iteration - 
checkpointInterval}"), true)
    }

System.gc() // hint Spark to clean shuffle directories
  }


Thanks
Ankur

On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Do you have more of the exception stack?


___

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
This is likely a factor of your hadoop config and Spark rather then anything 
specific with GraphFrames.

You might have better luck getting assistance if you could isolate the code to 
a simple case that manifests the problem (without GraphFrames), and repost.



From: Ankur Srivastava <ankur.srivast...@gmail.com>
Sent: Thursday, January 5, 2017 3:45:59 PM
To: Felix Cheung; d...@spark.apache.org
Cc: user@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents

Adding DEV mailing list to see if this is a defect with ConnectedComponent or 
if they can recommend any solution.

Thanks
Ankur

On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Yes I did try it out and it choses the local file system as my checkpoint 
location starts with s3n://

I am not sure how can I make it load the S3FileSystem.

On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: user@spark.apache.org<mailto:user@spark.apache.org>

Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.d

Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Felix Cheung
Do you have more of the exception stack?



From: Ankur Srivastava 
Sent: Wednesday, January 4, 2017 4:40:02 PM
To: user@spark.apache.org
Subject: Spark GraphFrame ConnectedComponents

Hi,

I am trying to use the ConnectedComponent algorithm of GraphFrames but by 
default it needs a checkpoint directory. As I am running my spark cluster with 
S3 as the DFS and do not have access to HDFS file system I tried using a s3 
directory as checkpoint directory but I run into below exception:


Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)

at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)

If I set checkpoint interval to -1 to avoid checkpointing the driver just hangs 
after 3 or 4 iterations.

Is there some way I can set the default FileSystem to S3 for Spark or any other 
option?

Thanks
Ankur



Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>


This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code 
"ConnectedComponents.scala:339" I am referring the code 
@https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala

  if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
// TODO: remove this after DataFrame.checkpoint is implemented
val out = s"${checkpointDir.get}/$iteration"
ee.write.parquet(out)
// may hit S3 eventually consistent issue
ee = sqlContext.read.parquet(out)

// remove previous checkpoint
if (iteration > checkpointInterval) {
  FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(s"${checkpointDir.get}/${iteration - 
checkpointInterval}"), true)
}

System.gc() // hint Spark to clean shuffle directories
  }


Thanks
Ankur

On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Do you have more of the exception stack?



From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 4:40:02 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Spark GraphFrame ConnectedComponents

Hi,

I am trying to use the ConnectedComponent algorithm of GraphFrames but by 
default it needs a checkpoint directory. As I am running my spark cluster with 
S3 as the DFS and do not have access to HDFS file system I tried using a s3 
directory as checkpoint directory but I run into below exception:


Exception in thread "main"java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)

at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)

If I set checkpoint interval to -1 to avoid chec

Re: Issue with SparkR setup on RStudio

2016-12-29 Thread Felix Cheung
Any reason you are setting HADOOP_HOME?

>From the error it seems you are running into issue with Hive config likely 
>with trying to load hive-site.xml. Could you try not setting HADOOP_HOME



From: Md. Rezaul Karim 
Sent: Thursday, December 29, 2016 10:24:57 AM
To: spark users
Subject: Issue with SparkR setup on RStudio

Dear Spark users,
I am trying to setup SparkR on RStudio to perform some basic data manipulations 
and ML modeling.  However, I am a strange error while creating SparkR session 
or DataFrame that says: java.lang.IllegalArgumentException Error while 
instantiating 'org.apache.spark.sql.hive.HiveSessionState.
According to Spark documentation at 
http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession, I 
don't need to configure Hive path or related variables.
I have the following source code:

SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]", 
sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory = 
"8g"), enableHiveSupport = TRUE)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
print(df)
head(df)
sparkR.session.stop()
Please note that the HADOOP_HOME  contains the 'winutils.exe' file. The details 
of the eror is as follows:

Error in handleErrors(returnStatus, conn) :  
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':

   at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

   at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

   at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)

   at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

   at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

   at scala.collection.Traversabl



 Any kind of help would be appreciated.


Regards,
_
Md. Rezaul Karim BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Felix Cheung
Have you tried the spark-csv package?

https://spark-packages.org/package/databricks/spark-csv



From: Raymond Xie 
Sent: Friday, December 30, 2016 6:46:11 PM
To: user@spark.apache.org
Subject: How to load a big csv to dataframe in Spark 1.6

Hello,

I see there is usually this way to load a csv to dataframe:


sqlContext = SQLContext(sc)

Employee_rdd = sc.textFile("\..\Employee.csv")
   .map(lambda line: line.split(","))

Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])

Employee_df.show()

However in my case my csv has 100+ fields, which means toDF() will be very 
lengthy.

Can anyone tell me a practical method to load the data?

Thank you very much.


Raymond



Re: Spark Graphx with Database

2016-12-30 Thread Felix Cheung
You might want to check out GraphFrames - to load database data (as Spark 
DataFrame) and build graphs with them


https://github.com/graphframes/graphframes

_
From: balaji9058 >
Sent: Monday, December 26, 2016 9:27 PM
Subject: Spark Graphx with Database
To: >


Hi All,

I would like to know about spark graphx execution/processing with
database.Yes, i understand spark graphx is in-memory processing but some
extent we can manage querying but would like to do much more complex query
or processing.Please suggest me the usecase or steps for the same.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Graphx-with-Database-tp28253.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: Difference in R and Spark Output

2016-12-30 Thread Felix Cheung
Could you elaborate more on the huge difference you are seeing?



From: Saroj C 
Sent: Friday, December 30, 2016 5:12:04 AM
To: User
Subject: Difference in R and Spark Output

Dear All,
 For the attached input file, there is a huge difference between the Clusters 
in R and Spark(ML). Any idea, what could be the difference ?

Note we wanted to create Five(5) clusters.

Please find the snippets in Spark and R

Spark

//Load the Data file

// Create K means Cluster
KMeans kmeans = new KMeans().setK(5).setMaxIter(500)

.setFeaturesCol("features").setPredictionCol("prediction");


In R

//Load the Data File into df

//Create the K Means Cluster

model <- kmeans(df, 5)



Thanks & Regards
Saroj

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you


Re: How to load a big csv to dataframe in Spark 1.6

2016-12-31 Thread Felix Cheung
Hmm this would seem unrelated? Does it work on the same box without the 
package? Do you have more of the error stack you can share?


_
From: Raymond Xie <xie3208...@gmail.com<mailto:xie3208...@gmail.com>>
Sent: Saturday, December 31, 2016 8:09 AM
Subject: Re: How to load a big csv to dataframe in Spark 1.6
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>


Hello Felix,

I followed the instruction and ran the command:

> $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

and I received the following error message:
java.lang.RuntimeException: java.net.ConnectException: Call From 
xie1/192.168.112.150<http://192.168.112.150> to localhost:9000 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused

any thought?




Sincerely yours,


Raymond

On Fri, Dec 30, 2016 at 10:08 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Have you tried the spark-csv package?

https://spark-packages.org/package/databricks/spark-csv



From: Raymond Xie <xie3208...@gmail.com<mailto:xie3208...@gmail.com>>
Sent: Friday, December 30, 2016 6:46:11 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: How to load a big csv to dataframe in Spark 1.6

Hello,

I see there is usually this way to load a csv to dataframe:


sqlContext = SQLContext(sc)Employee_rdd = sc.textFile("\..\Employee.csv")   
.map(lambda line: line.split(","))Employee_df = 
Employee_rdd.toDF(['Employee_ID','Employee_name'])Employee_df.show()

However in my case my csv has 100+ fields, which means toDF() will be very 
lengthy.

Can anyone tell me a practical method to load the data?

Thank you very much.


Raymond






Re: Issue with SparkR setup on RStudio

2017-01-02 Thread Felix Cheung
Perhaps it is with

spark.sql.warehouse.dir="E:/Exp/"

That you have in the sparkConfig parameter.

Unfortunately the exception stack is fairly far away from the actual error, but 
from the top of my head spark.sql.warehouse.dir and HADOOP_HOME are the two 
different pieces that is not set in the Windows tests.


_
From: Md. Rezaul Karim 
<rezaul.ka...@insight-centre.org<mailto:rezaul.ka...@insight-centre.org>>
Sent: Monday, January 2, 2017 7:58 AM
Subject: Re: Issue with SparkR setup on RStudio
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: spark users <user@spark.apache.org<mailto:user@spark.apache.org>>


Hello Cheung,

Happy New Year!

No, I did not configure Hive on my machine. Even I have tried not setting the 
HADOOP_HOME but getting the same error.



Regards,
_
Md. Rezaul Karim BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web:http://www.reza-analytics.eu/index.html<http://139.59.184.114/index.html>

On 29 December 2016 at 19:16, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Any reason you are setting HADOOP_HOME?

>From the error it seems you are running into issue with Hive config likely 
>with trying to load hive-site.xml. Could you try not setting HADOOP_HOME



From: Md. Rezaul Karim 
<rezaul.ka...@insight-centre.org<mailto:rezaul.ka...@insight-centre.org>>
Sent: Thursday, December 29, 2016 10:24:57 AM
To: spark users
Subject: Issue with SparkR setup on RStudio

Dear Spark users,
I am trying to setup SparkR on RStudio to perform some basic data manipulations 
and MLmodeling.  However, I am a strange error while creating SparkR session or 
DataFrame that says:java.lang.IllegalArgumentException Error while 
instantiating 'org.apache.spark.sql.hive.HiveSessionState.
According to Spark documentation 
athttp://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession, I 
don’t need to configure Hive path or related variables.
I have the following source code:

SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]", 
sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory = 
"8g"), enableHiveSupport = TRUE)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
print(df)
head(df)
sparkR.session.stop()
Please note that the HADOOP_HOME contains the ‘winutils.exe’ file. The details 
of the eror is as follows:

Error in handleErrors(returnStatus, conn) :  
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':

   at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

   at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

   at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)

   at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

   at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

   at scala.collection.Traversabl



 Any kind of help would be appreciated.


Regards,
_
Md. Rezaul Karim BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web:http://www.reza-analytics.eu/index.html<http://139.59.184.114/index.html>





Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Or this is a better link:
http://graphframes.github.io/quick-start.html

_
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, December 18, 2016 8:46 PM
Subject: Re: GraphFrame not init vertices when load edges
To: <zjp_j...@163.com<mailto:zjp_j...@163.com>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com<mailto:zjp_j...@163.com> 
<zjp_j...@163.com<mailto:zjp_j...@163.com>>
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(  ("a", "b", "friend"),  ("b", "c", 
"follow"),  ("c", "b", "follow"),  ("f", "c", "follow"),  ("e", "f", "follow"), 
 ("e", "d", "friend"),  ("d", "a", "friend"),  ("a", "e", 
"friend"))).toDF("src", "dst", "relationship")


zjp_j...@163.com<mailto:zjp_j...@163.com>




Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com 
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
)).toDF("src", "dst", "relationship")


zjp_j...@163.com


Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
There is not a GraphLoader for GraphFrames but you could load and convert from 
GraphX:


http://graphframes.github.io/user-guide.html#graphx-to-graphframe


From: zjp_j...@163.com <zjp_j...@163.com>
Sent: Sunday, December 18, 2016 9:39:49 PM
To: Felix Cheung; user
Subject: Re: Re: GraphFrame not init vertices when load edges

I'm sorry, i  didn't expressed clearly.  Reference to the following Blod 
Underlined text.

 cite from http://spark.apache.org/docs/latest/graphx-programming-guide.html
" 
GraphLoader.edgeListFile<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]>
 provides a way to load a graph from a list of edges on disk. It parses an 
adjacency list of (source vertex ID, destination vertex ID) pairs of the 
following form, skipping comment lines that begin with #:

# This is a comment
2 1
4 1
1 2


It creates a Graph from the specified edges, automatically creating any 
vertices mentioned by edges."


Creating any vertices when create a Graph from specified edges that I think 
it's a good way, but now  
GraphLoader.edgeListFile<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]>
 load format is not allowed to set edge attribute in edge file, So I want to 
know GraphFrames has any plan about it or better ways.

Thannks





____
zjp_j...@163.com

From: Felix Cheung<mailto:felixcheun...@hotmail.com>
Date: 2016-12-19 12:57
To: zjp_j...@163.com<mailto:zjp_j...@163.com>; 
user<mailto:user@spark.apache.org>
Subject: Re: GraphFrame not init vertices when load edges
Or this is a better link:
http://graphframes.github.io/quick-start.html

_____
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, December 18, 2016 8:46 PM
Subject: Re: GraphFrame not init vertices when load edges
To: <zjp_j...@163.com<mailto:zjp_j...@163.com>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com<mailto:zjp_j...@163.com> 
<zjp_j...@163.com<mailto:zjp_j...@163.com>>
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(  ("a", "b", "friend"),  ("b", "c", 
"follow"),  ("c", "b", "follow"),  ("f", "c", "follow"),  ("e", "f", "follow"), 
 ("e", "d", "friend"),  ("d", "a", "friend"),  ("a", "e", 
"friend"))).toDF("src", "dst", "relationship")


zjp_j...@163.com<mailto:zjp_j...@163.com>




Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-02 Thread Felix Cheung
Interesting!


From: Robert Yokota 
Sent: Sunday, April 2, 2017 9:40:07 AM
To: user@spark.apache.org
Subject: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

Hi,

In case anyone is interested in analyzing graphs in HBase with Apache Spark 
GraphFrames, this might be helpful:

https://yokota.blog/2017/04/02/graph-analytics-on-hbase-with-hgraphdb-and-spark-graphframes/


Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-22 Thread Felix Cheung
Cross session is this context is multiple spark sessions from the same spark 
context. Since you are running two shells, you are having different spark 
context.

Do you have to you a temp view? Could you create a table?

_
From: Hemanth Gudela 
>
Sent: Saturday, April 22, 2017 12:57 AM
Subject: Spark SQL - Global Temporary View is not behaving as expected
To: >


Hi,

According to 
documentation,
 global temporary views are cross-session accessible.

But when I try to query a global temporary view from another spark shell like 
this-->
Instance 1 of spark-shell
--
scala> spark.sql("select 1 as col1").createGlobalTempView("gView1")

Instance 2 of spark-shell (while Instance 1 of spark-shell is still alive)
-
scala> spark.sql("select * from global_temp.gView1").show()
org.apache.spark.sql.AnalysisException: Table or view not found: 
`global_temp`.`gView1`
'Project [*]
+- 'UnresolvedRelation `global_temp`.`gView1`

I am expecting that global temporary view created in shell 1 should be 
accessible in shell 2, but it isn’t!
Please correct me if I missing something here.

Thanks (in advance),
Hemanth




Re: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ?

2017-04-21 Thread Felix Cheung
Not currently - how are you planning to use the output from word2vec?


From: Radhwane Chebaane 
Sent: Thursday, April 20, 2017 4:30:14 AM
To: user@spark.apache.org
Subject: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ?

Hi,

I've been experimenting with the Spark Word2vec implementation in the
MLLib package with Scala and it was very nice.
I need to use the same algorithm in R leveraging the power of spark 
distribution with SparkR.
I have been looking on the mailing list and Stackoverflow for any Word2vec 
use-case in SparkR but no luck.

Is there any implementation of Word2vec in SparkR ? Is there any current work 
to support this feature in MLlib with R?

Thanks!
Radhwane Chebaane

--

[photo] Radhwane Chebaane
Distributed systems engineer, Mindlytix

Mail: radhw...@mindlytix.com 
Mobile: +33 695 588 906 

Skype: rad.cheb 
LinkedIn 




Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Felix Cheung
Awesome! Congrats!!


From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Wednesday, July 12, 2017 12:26:00 PM
To: user@spark.apache.org
Subject: With 2.2.0 PySpark is now available for pip install from PyPI :)

Hi wonderful Python + Spark folks,

I'm excited to announce that with Spark 2.2.0 we finally have PySpark published 
on PyPI (see https://pypi.python.org/pypi/pyspark / 
https://twitter.com/holdenkarau/status/885207416173756417). This has been a 
long time coming (previous releases included pip installable artifacts that for 
a variety of reasons couldn't be published to PyPI). So if you (or your 
friends) want to be able to work with PySpark locally on your laptop you've got 
an easier path getting started (pip install pyspark).

If you are setting up a standalone cluster your cluster will still need the 
"full" Spark packaging, but the pip installed PySpark should be able to work 
with YARN or an existing standalone cluster installation (of the same version).

Happy Sparking y'all!

Holden :)


--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: how to create List in pyspark

2017-04-28 Thread Felix Cheung
Why no use sql functions explode and split?
Would perform and be more stable then udf


From: Yanbo Liang 
Sent: Thursday, April 27, 2017 7:34:54 AM
To: Selvam Raman
Cc: user
Subject: Re: how to create List in pyspark

​You can try with UDF, like the following code snippet:

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
df = spark.read.text("./README.md")​
split_func = udf(lambda text: text.split(" "), ArrayType(StringType()))
df.withColumn("split_value", split_func("value")).show()

Thanks
Yanbo

On Tue, Apr 25, 2017 at 12:27 AM, Selvam Raman 
> wrote:

documentDF = spark.createDataFrame([

("Hi I heard about Spark".split(" "), ),

("I wish Java could use case classes".split(" "), ),

("Logistic regression models are neat".split(" "), )

], ["text"])


How can i achieve the same df while i am reading from source?

doc = spark.read.text("/Users/rs/Desktop/nohup.out")

how can i create array type with "sentences" column from doc(dataframe)


The below one creates more than one column.

rdd.map(lambda rdd: rdd[0]).map(lambda row:row.split(" "))

--
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"



Re: "java.lang.IllegalStateException: There is no space for new record" in GraphFrames

2017-04-28 Thread Felix Cheung
Can you allocate more memory to the executor?

Also please open issue with gf on its github


From: rok 
Sent: Friday, April 28, 2017 1:42:33 AM
To: user@spark.apache.org
Subject: "java.lang.IllegalStateException: There is no space for new record" in 
GraphFrames

When running the connectedComponents algorithm in GraphFrames on a
sufficiently large dataset, I get the following error I have not encountered
before:

17/04/20 20:35:26 WARN TaskSetManager: Lost task 3.0 in stage 101.0 (TID
53644, 172.19.1.206, executor 40): java.lang.IllegalStateException: There is
no space for new record
at
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:225)
at
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
at
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Any thoughts on how to avoid this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-in-GraphFrames-tp28635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Felix Cheung
And perhaps the error message can be improved here?


From: Tathagata Das 
Sent: Monday, June 19, 2017 8:24:01 PM
To: kaniska Mandal
Cc: Burak Yavuz; user
Subject: Re: How save streaming aggregations on 'Structured Streams' in parquet 
format ?

That is not the write way to use watermark + append output mode. The 
`withWatermark` must be before the aggregation. Something like this.

df.withWatermark("timestamp", "1 hour")
  .groupBy(window("timestamp", "30 seconds"))
  .agg(...)

Read more here - 
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html


On Mon, Jun 19, 2017 at 7:27 PM, kaniska Mandal 
> wrote:
Hi Burak,

Per your suggestion, I have specified
> deviceBasicAgg.withWatermark("eventtime", "30 seconds");
before invoking deviceBasicAgg.writeStream()...

But I am still facing ~

org.apache.spark.sql.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets;

I am Ok with 'complete' output mode.

=

I tried another approach - Creating parquet file from the in-memory dataset ~ 
which seems to work.

But I need only the delta, not the cumulative count. Since 'append' mode not 
supporting streaming Aggregation, I have to use 'complete' outputMode.

StreamingQuery streamingQry = deviceBasicAgg.writeStream()

  .format("memory")

  .trigger(ProcessingTime.create("5 seconds"))

  .queryName("deviceBasicAggSummary")

  .outputMode("complete")

  .option("checkpointLocation", "/tmp/parquet/checkpoints/")

  .start();

streamingQry.awaitTermination();

Thread.sleep(5000);

while(true) {

Dataset deviceBasicAggSummaryData = spark.table("deviceBasicAggSummary");

deviceBasicAggSummaryData.toDF().write().parquet("/data/summary/devices/"+new 
Date().getTime()+"/");

}

=

So whats the best practice for 'low latency query on distributed data' using 
Spark SQL and Structured Streaming ?


Thanks

Kaniska


On Mon, Jun 19, 2017 at 11:55 AM, Burak Yavuz 
> wrote:
Hi Kaniska,

In order to use append mode with aggregations, you need to set an event time 
watermark (using `withWatermark`). Otherwise, Spark doesn't know when to output 
an aggregation result as "final".

Best,
Burak

On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal 
> wrote:
Hi,

My goal is to ~
(1) either chain streaming aggregations in a single query OR
(2) run multiple streaming aggregations and save data in some meaningful format 
to execute low latency / failsafe OLAP queries

So my first choice is parquet format , but I failed to make it work !

I am using spark-streaming_2.11-2.1.1

I am facing the following error -
org.apache.spark.sql.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets;

- for the following syntax

 StreamingQuery streamingQry = tagBasicAgg.writeStream()

  .format("parquet")

  .trigger(ProcessingTime.create("10 seconds"))

  .queryName("tagAggSummary")

  .outputMode("append")

  .option("checkpointLocation", "/tmp/summary/checkpoints/")

  .option("path", "/data/summary/tags/")

  .start();

But, parquet doesn't support 'complete' outputMode.

So is parquet supported only for batch queries , NOT for streaming queries ?

- note that console outputmode working fine !

Any help will be much appreciated.

Thanks
Kaniska






Re: problem initiating spark context with pyspark

2017-06-10 Thread Felix Cheung
Curtis, assuming you are running a somewhat recent windows version you would 
not have access to c:\tmp, in your command example

winutils.exe ls -F C:\tmp\hive

Try changing the path to under your user directory.

Running Spark on Windows should work :)


From: Curtis Burkhalter 
Sent: Wednesday, June 7, 2017 7:46:56 AM
To: Doc Dwarf
Cc: user@spark.apache.org
Subject: Re: problem initiating spark context with pyspark

Thanks Doc I saw this on another board yesterday so I've tried this by first 
going to the directory where I've stored the wintutils.exe and then as an admin 
running the command  that you suggested and I get this exception when checking 
the permissions:

C:\winutils\bin>winutils.exe ls -F C:\tmp\hive
FindFileOwnerAndPermission error (1789): The trust relationship between this 
workstation and the primary domain failed.

I'm fairly new to the command line and determining what the different 
exceptions mean. Do you have any advice what this error means and how I might 
go about fixing this?

Thanks again


On Wed, Jun 7, 2017 at 9:51 AM, Doc Dwarf 
> wrote:
Hi Curtis,

I believe in windows, the following command needs to be executed: (will need 
winutils installed)

D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive



On 6 June 2017 at 09:45, Curtis Burkhalter 
> wrote:
Hello all,

I'm new to Spark and I'm trying to interact with it using Pyspark. I'm using 
the prebuilt version of spark v. 2.1.1 and when I go to the command line and 
use the command 'bin\pyspark' I have initialization problems and get the 
following message:

C:\spark\spark-2.1.1-bin-hadoop2.7> bin\pyspark
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC 
v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/06 10:30:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/06/06 10:30:21 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
17/06/06 10:30:21 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
Traceback (most recent call last):
  File "C:\spark\spark-2.1.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 
63, in deco
return f(*a, **kw)
  File 
"C:\spark\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
... 13 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveExternalCatalog':
at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
at 

Re: graphframes on cluster

2017-09-20 Thread Felix Cheung
Could you include the code where it fails?
Generally the best way to use gf is to use the --packages options with 
spark-submit command


From: Imran Rajjad 
Sent: Wednesday, September 20, 2017 5:47:27 AM
To: user @spark
Subject: graphframes on cluster

Trying to run graph frames on a spark cluster. Do I need to include the package 
in spark context settings? or the only the driver program is suppose to have 
the graphframe libraries in its class path? Currently the job is crashing when 
action function is invoked on graphframe classes.

regards,
Imran

--
I.R


Re: How to convert Row to JSON in Java?

2017-09-09 Thread Felix Cheung
toJSON on Dataset/DataFrame?


From: kant kodali 
Sent: Saturday, September 9, 2017 4:15:49 PM
To: user @spark
Subject: How to convert Row to JSON in Java?

Hi All,

How to convert Row to JSON in Java? It would be nice to have .toJson() method 
in the Row class.

Thanks,
kant


Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread Felix Cheung
What is newDS?
If it is a Streaming Dataset/DataFrame (since you have writeStream there) then 
there seems to be an issue preventing toJSON to work.


From: kant kodali 
Sent: Saturday, September 9, 2017 4:04:33 PM
To: user @spark
Subject: Queries with streaming sources must be executed with 
writeStream.start()

Hi All,

I  have the following code and I am not sure what's wrong with it? I cannot 
call dataset.toJSON() (which returns a DataSet) ? I am using spark 2.2.0 so I 
am wondering if there is any work around?


 Dataset ds = newDS.toJSON().map(()->{some function that returns a 
string});
 StreamingQuery query = ds.writeStream().start();
 query.awaitTermination();


Re: using R with Spark

2017-09-24 Thread Felix Cheung
Both are free to use; you can use sparklyr from the R shell without RStudio 
(but you probably want an IDE)



From: Adaryl Wakefield 
Sent: Sunday, September 24, 2017 11:19:24 AM
To: user@spark.apache.org
Subject: using R with Spark

There are two packages SparkR and sparklyr. Sparklyr seems to be the more 
useful. However, do you have to pay to use it? Unless I’m not reading this 
right, it seems you have to have the paid version of RStudio to use it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData




Re: using R with Spark

2017-09-24 Thread Felix Cheung
If you google it you will find posts or info on how to connect it to different 
cloud and hadoop/spark vendors.



From: Georg Heiler <georg.kf.hei...@gmail.com>
Sent: Sunday, September 24, 2017 1:39:09 PM
To: Felix Cheung; Adaryl Wakefield; user@spark.apache.org
Subject: Re: using R with Spark

No. It is free for use might need r studio server depending on which spark 
master you choose.
Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> 
schrieb am So. 24. Sep. 2017 um 22:24:
Both are free to use; you can use sparklyr from the R shell without RStudio 
(but you probably want an IDE)


From: Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com<mailto:adaryl.wakefi...@hotmail.com>>
Sent: Sunday, September 24, 2017 11:19:24 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: using R with Spark

There are two packages SparkR and sparklyr. Sparklyr seems to be the more 
useful. However, do you have to pay to use it? Unless I’m not reading this 
right, it seems you have to have the paid version of RStudio to use it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>




Re: using R with Spark

2017-09-24 Thread Felix Cheung
There are other approaches like this

Find Livy on the page
https://blog.rstudio.com/2017/01/24/sparklyr-0-5/

Probably will be best to follow up with sparklyr for any support question.


From: Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
Sent: Sunday, September 24, 2017 2:42:19 PM
To: user@spark.apache.org
Subject: RE: using R with Spark

>It is free for use might need r studio server depending on which spark master 
>you choose.
Yeah I think that’s where my confusion is coming from. I’m looking at a cheat 
sheet. For connecting to a Yarn Cluster the first step is;

  1.  Install RStudio Server or RStudio Pro on one of the existing edge nodes.

As a matter of fact, it looks like any instance where you’re connecting to a 
cluster requires the paid version of RStudio. All the links I google are 
suggesting this. And then there is this:
https://stackoverflow.com/questions/39798798/connect-sparklyr-to-remote-spark-connection

That’s about a year old, but I haven’t found anything that contradicts it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


From: Georg Heiler [mailto:georg.kf.hei...@gmail.com]
Sent: Sunday, September 24, 2017 3:39 PM
To: Felix Cheung <felixcheun...@hotmail.com>; Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com>; user@spark.apache.org
Subject: Re: using R with Spark

No. It is free for use might need r studio server depending on which spark 
master you choose.
Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> 
schrieb am So. 24. Sep. 2017 um 22:24:
Both are free to use; you can use sparklyr from the R shell without RStudio 
(but you probably want an IDE)


From: Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com<mailto:adaryl.wakefi...@hotmail.com>>
Sent: Sunday, September 24, 2017 11:19:24 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: using R with Spark

There are two packages SparkR and sparklyr. Sparklyr seems to be the more 
useful. However, do you have to pay to use it? Unless I’m not reading this 
right, it seems you have to have the paid version of RStudio to use it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>




Re: sparkR 3rd library

2017-09-04 Thread Felix Cheung
Can you include the code you call spark.lapply?



From: patcharee 
Sent: Sunday, September 3, 2017 11:46:40 PM
To: spar >> user@spark.apache.org
Subject: sparkR 3rd library

Hi,

I am using spark.lapply to execute an existing R script in standalone
mode. This script calls a function 'rbga' from a 3rd library 'genalg'.
This rbga function works fine in sparkR env when I call it directly, but
when I apply this to spark.lapply I get the error

could not find function "rbga"
 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala

Any ideas/suggestions?

BR, Patcharee


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark R]: dapply only works for very small datasets

2017-11-27 Thread Felix Cheung
What's the number of executor and/or number of partitions you are working with?

I'm afraid most of the problem is with the serialization deserialization 
overhead between JVM and R...


From: Kunft, Andreas 
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark.apache.org
Subject: [Spark R]: dapply only works for very small datasets


Hello,


I tried to execute some user defined functions with R using the airline arrival 
performance dataset.

While the examples from the documentation for the `<-` apply operator work 
perfectly fine on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.


I'm using a function similar to the one from the documentation:


df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now 
the only answer I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several 
DOPs.

Best
Andreas



Re: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Felix Cheung
You can find more discussions in
https://issues.apache.org/jira/browse/SPARK-18924
And
https://issues.apache.org/jira/browse/SPARK-17634

I suspect the cost is linear - so partitioning the data into smaller chunks 
with more executors (one core each) running in parallel would probably help a 
bit.

Unfortunately this is an area that we really would use some improvements on, 
and I think it *should* be possible (hmm  
https://databricks.com/blog/2017/10/06/accelerating-r-workflows-on-databricks.html.
 ;)

_
From: Kunft, Andreas <andreas.ku...@tu-berlin.de>
Sent: Tuesday, November 28, 2017 3:11 AM
Subject: AW: [Spark R]: dapply only works for very small datasets
To: Felix Cheung <felixcheun...@hotmail.com>, <user@spark.apache.org>



Thanks for the fast reply.


I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as well 
as on 4 nodes with the same specifications.

When I shrink the data to around 100MB,

it runs in about 1 hour for 1 core and about 6 min with 8 cores.


I'm aware that the serDe takes time, but it seems there must be something else 
off considering these numbers.


____
Von: Felix Cheung <felixcheun...@hotmail.com>
Gesendet: Montag, 27. November 2017 20:20
An: Kunft, Andreas; user@spark.apache.org
Betreff: Re: [Spark R]: dapply only works for very small datasets

What’s the number of executor and/or number of partitions you are working with?

I’m afraid most of the problem is with the serialization deserialization 
overhead between JVM and R...


From: Kunft, Andreas <andreas.ku...@tu-berlin.de>
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark.apache.org
Subject: [Spark R]: dapply only works for very small datasets


Hello,


I tried to execute some user defined functions with R using the airline arrival 
performance dataset.

While the examples from the documentation for the `<-` apply operator work 
perfectly fine on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.


I'm using a function similar to the one from the documentation:


df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now 
the only answer I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several 
DOPs.

Best
Andreas





Re: all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Felix Cheung
Zeppelin keeps the Spark job alive. This is likely a better question for the 
Zeppelin project.


From: Valery Khamenya 
Sent: Tuesday, May 1, 2018 4:30:24 AM
To: user@spark.apache.org
Subject: all calculations finished, but "VCores Used" value remains at its max

Hi all

I experience a strange thing: when Spark 2.3.0 calculations started from 
Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager stays 
at its maximum, albeit nothing is assumed to be calculated anymore. How come?

if relevant, I experience this issue since AWS EMR 5.13.0

best regards
--
Valery


best regards
--
Valery A.Khamenya


Re: Passing an array of more than 22 elements in a UDF

2017-12-24 Thread Felix Cheung
Or use it with Scala 2.11?


From: ayan guha 
Sent: Friday, December 22, 2017 3:15:14 AM
To: Aakash Basu
Cc: user
Subject: Re: Passing an array of more than 22 elements in a UDF

Hi I think you are in correct track. You can stuff all your param in a suitable 
data structure like array or dict and pass this structure as a single param in 
your udf.

On Fri, 22 Dec 2017 at 2:55 pm, Aakash Basu 
> wrote:
Hi,

I am using Spark 2.2 using Java, can anyone please suggest me how to take more 
than 22 parameters in an UDF? I mean, if I want to pass all the parameters as 
an array of integers?

Thanks,
Aakash.
--
Best Regards,
Ayan Guha


Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
And Hadoop-3.x is not part of the release and sign off for 2.2.1.

Maybe we could update the website to avoid any confusion with "later".


From: Josh Rosen 
Sent: Monday, January 8, 2018 10:17:14 AM
To: akshay naidu
Cc: Saisai Shao; Raj Adyanthaya; spark users
Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

My current best guess is that Spark does not fully support Hadoop 3.x because 
https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for 
Hadoop 3.x) has not been resolved. There are also likely to be transitive 
dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu 
> wrote:
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and 
later', but my confusion is because spark was released on 1st dec and hadoop-3 
stable version released on 13th Dec. And  to my similar question on 
stackoverflow.com
 , Mr. jacek-laskowski 
replied that spark-2.2.1 doesn't support hadoop-3. so I am just looking for 
more clarity on this doubt before moving on to upgrades.

Thanks all for help.

Akshay.

On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
> wrote:
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is 
not clear whether it is supported or not (or has some issues). I think in the 
download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it 
supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).

Thanks
Jerry

2018-01-08 4:50 GMT+08:00 Raj Adyanthaya 
>:
Hi Akshay

On the Spark Download page when you select Spark 2.2.1 it gives you an option 
to select package type. In that, there is an option to select  "Pre-Built for 
Apache Hadoop 2.7 and later". I am assuming it means that it does support 
Hadoop 3.0.

http://spark.apache.org/downloads.html

Thanks,
Raj A.

On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
> wrote:
hello Users,
I need to know whether we can run latest spark on  latest hadoop version i.e., 
spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
thanks.





  1   2   >