Re: SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
The correct link is
https://docs.databricks.com/spark/latest/spark-sql/index.html .

This link does have the core syntax such as the BNF for the DDL and DML and
SELECT.  It does *not *have a reference for  date / string / numeric
functions: is there any such reference at this point?  It is not sufficient
to peruse the DSL list of functions since the usage is different (and
sometimes the names as well)  than from the DSL.

thanks
stephenb

2017-08-10 14:49 GMT-07:00 Jules Damji :

> I refer to docs.databricks.com/Spark/latest/Spark-sql/index.html.
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> > On Aug 10, 2017, at 1:46 PM, Stephen Boesch  wrote:
> >
> >
> > While the DataFrame/DataSets are useful in many circumstances they are
> cumbersome for many types of complex sql queries.
> >
> > Is there an up to date *SQL* reference - i.e. not DataFrame DSL
> operations - for version 2.2?
> >
> > An example of what is not clear:  what constructs are supported within
> >
> > select count( predicate) from some_table
> >
> > when using spark sql.
> >
> > But in general the reference guide and programming guide for SQL seems
> to be difficult to locate - seemingly in favor of the DataFrame/DataSets.
>
>


Re: How can I tell if a Spark job is successful or not?

2017-08-10 Thread Ryan
you could exit with error code just like normal java/scala application, and
get it from driver/yarn

On Fri, Aug 11, 2017 at 9:55 AM, Wei Zhang 
wrote:

> I suppose you can find the job status from Yarn UI application view.
>
>
>
> Cheers,
>
> -z
>
>
>
> *From:* 陈宇航 [mailto:yuhang.c...@foxmail.com]
> *Sent:* Thursday, August 10, 2017 5:23 PM
> *To:* user 
> *Subject:* How can I tell if a Spark job is successful or not?
>
>
>
> I want to do some clean-ups after a Spark job is finished, and the action
> I would do depends on whether the job is successful or not.
>
> So how where can I get the result for the job?
>
> I already tried the SparkListener, it worked fine when the job is
> successful, but if the job fails, the listener seems not called.
>


RE: How can I tell if a Spark job is successful or not?

2017-08-10 Thread Wei Zhang
I suppose you can find the job status from Yarn UI application view.

Cheers,
-z

From: 陈宇航 [mailto:yuhang.c...@foxmail.com]
Sent: Thursday, August 10, 2017 5:23 PM
To: user 
Subject: How can I tell if a Spark job is successful or not?


I want to do some clean-ups after a Spark job is finished, and the action I 
would do depends on whether the job is successful or not.

So how where can I get the result for the job?

I already tried the SparkListener, it worked fine when the job is successful, 
but if the job fails, the listener seems not called.


Issues when trying to recover a textFileStream from checkpoint in Spark streaming

2017-08-10 Thread SRK
Hi,

I am facing issues while trying to recover a textFileStream from checkpoint.
Basically it is trying to load the files from the begining  of the job start
whereas I am deleting the files after processing them. I have the following
configs set so was thinking that it should not look for files beyond 2
minutes when trying to recover from checkpoint. Any suggestions on this
would be of great help.

  sparkConf.set("spark.streaming.minRememberDuration","120s")
  sparkConf.set("spark.streaming.fileStream.minRememberDuration","120s")

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-trying-to-recover-a-textFileStream-from-checkpoint-in-Spark-streaming-tp29052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SQL specific documentation for recent Spark releases

2017-08-10 Thread Jules Damji
I refer to docs.databricks.com/Spark/latest/Spark-sql/index.html. 

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Aug 10, 2017, at 1:46 PM, Stephen Boesch  wrote:
> 
> 
> While the DataFrame/DataSets are useful in many circumstances they are 
> cumbersome for many types of complex sql queries.
> 
> Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations - 
> for version 2.2?
> 
> An example of what is not clear:  what constructs are supported within 
>  
> select count( predicate) from some_table
> 
> when using spark sql.
> 
> But in general the reference guide and programming guide for SQL seems to be 
> difficult to locate - seemingly in favor of the DataFrame/DataSets.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Spark SQL uses Calcite?

2017-08-10 Thread Jules Damji
Yes, it's more used in Hive than Spark 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Aug 10, 2017, at 2:24 PM, Sathish Kumaran Vairavelu 
>  wrote:
> 
> I think it is for hive dependency.
>> On Thu, Aug 10, 2017 at 4:14 PM kant kodali  wrote:
>> Since I see a calcite dependency in Spark I wonder where Calcite is being 
>> used?
>> 
>>> On Thu, Aug 10, 2017 at 1:30 PM, Sathish Kumaran Vairavelu 
>>>  wrote:
>>> Spark SQL doesn't use Calcite
>>> 
 On Thu, Aug 10, 2017 at 3:14 PM kant kodali  wrote:
 Hi All, 
 
 Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has 
 catalyst which would generate its own logical plans, physical plans and 
 other optimizations. 
 
 Thanks,
 Kant
>> 


Re: Does Spark SQL uses Calcite?

2017-08-10 Thread Sathish Kumaran Vairavelu
I think it is for hive dependency.
On Thu, Aug 10, 2017 at 4:14 PM kant kodali  wrote:

> Since I see a calcite dependency in Spark I wonder where Calcite is being
> used?
>
> On Thu, Aug 10, 2017 at 1:30 PM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Spark SQL doesn't use Calcite
>>
>> On Thu, Aug 10, 2017 at 3:14 PM kant kodali  wrote:
>>
>>> Hi All,
>>>
>>> Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL
>>> has catalyst which would generate its own logical plans, physical plans and
>>> other optimizations.
>>>
>>> Thanks,
>>> Kant
>>>
>>
>


Re: Does Spark SQL uses Calcite?

2017-08-10 Thread kant kodali
Since I see a calcite dependency in Spark I wonder where Calcite is being
used?

On Thu, Aug 10, 2017 at 1:30 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Spark SQL doesn't use Calcite
>
> On Thu, Aug 10, 2017 at 3:14 PM kant kodali  wrote:
>
>> Hi All,
>>
>> Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has
>> catalyst which would generate its own logical plans, physical plans and
>> other optimizations.
>>
>> Thanks,
>> Kant
>>
>


SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
While the DataFrame/DataSets are useful in many circumstances they are
cumbersome for many types of complex sql queries.

Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations
- for version 2.2?

An example of what is not clear:  what constructs are supported within

select count( predicate) from some_table

when using spark sql.

But in general the reference guide and programming guide for SQL seems to
be difficult to locate - seemingly in favor of the DataFrame/DataSets.


Re: Does Spark SQL uses Calcite?

2017-08-10 Thread Sathish Kumaran Vairavelu
Spark SQL doesn't use Calcite
On Thu, Aug 10, 2017 at 3:14 PM kant kodali  wrote:

> Hi All,
>
> Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has
> catalyst which would generate its own logical plans, physical plans and
> other optimizations.
>
> Thanks,
> Kant
>


Does Spark SQL uses Calcite?

2017-08-10 Thread kant kodali
Hi All,

Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has
catalyst which would generate its own logical plans, physical plans and
other optimizations.

Thanks,
Kant


Re: How do I pass multiple cassandra hosts in spark submit?

2017-08-10 Thread shyla deshpande
Got the answer from
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/ETCZdCcaKq8



On Thu, Aug 10, 2017 at 11:59 AM, shyla deshpande 
wrote:

> I have a 3 node cassandra cluster. I want to pass all the 3 nodes in spark
> submit. How do I do that.
> Any code samples will help.
> Thanks
>


How do I pass multiple cassandra hosts in spark submit?

2017-08-10 Thread shyla deshpande
I have a 3 node cassandra cluster. I want to pass all the 3 nodes in spark
submit. How do I do that.
Any code samples will help.
Thanks


Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-10 Thread shyla deshpande
Thanks Cody.

On Wed, Aug 9, 2017 at 8:46 AM, Cody Koeninger  wrote:

> org.apache.spark.streaming.kafka.KafkaCluster has methods
> getLatestLeaderOffsets and getEarliestLeaderOffsets
>
> On Mon, Aug 7, 2017 at 11:37 PM, shyla deshpande
>  wrote:
> > Thanks TD.
> >
> > On Mon, Aug 7, 2017 at 8:59 PM, Tathagata Das <
> tathagata.das1...@gmail.com>
> > wrote:
> >>
> >> I dont think there is any easier way.
> >>
> >> On Mon, Aug 7, 2017 at 7:32 PM, shyla deshpande <
> deshpandesh...@gmail.com>
> >> wrote:
> >>>
> >>> Thanks TD for the response. I forgot to mention that I am not using
> >>> structured streaming.
> >>>
> >>> I was looking into KafkaUtils.createRDD, and looks like I need to get
> the
> >>> earliest and the latest offset for each partition to build the
> >>> Array(offsetRange). I wanted to know if there was a easier way.
> >>>
> >>> 1 reason why we are hesitating to use structured streaming is because I
> >>> need to persist the data in Cassandra database which I believe is not
> >>> production ready.
> >>>
> >>>
> >>> On Mon, Aug 7, 2017 at 6:11 PM, Tathagata Das
> >>>  wrote:
> 
>  Its best to use DataFrames. You can read from as streaming or as
> batch.
>  More details here.
> 
> 
>  https://spark.apache.org/docs/latest/structured-streaming-
> kafka-integration.html#creating-a-kafka-source-for-batch-queries
> 
>  https://databricks.com/blog/2017/04/26/processing-data-in-
> apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
> 
>  On Mon, Aug 7, 2017 at 6:03 PM, shyla deshpande
>   wrote:
> >
> > Hi all,
> >
> > What is the easiest way to read all the data from kafka in a batch
> > program for a given topic?
> > I have 10 kafka partitions, but the data is not much. I would like to
> > read  from the earliest from all the partitions for a topic.
> >
> > I appreciate any help. Thanks
> 
> 
> >>>
> >>
> >
>


Spark streaming - Processing time keeps on increasing under following scenario

2017-08-10 Thread Ravi Gurram
Hi,

 

I have a spark streaming task that basically does the following,

 

1.  Read a batch using a custom receiver
2.  Parse and apply transforms to the batch
3.  Convert the raw fields to a bunch of features
4.  Use a pre-built model to predict the class of each record in the
batch
5.  Output the result to a DB

 

Everything was working fine and my streaming pipeline was pretty stable.

 

I realized that the results were wrong as some of the features in the
records required cumulative data, 

for example "Ratio of Dest IP" to "Total number of IPs", for a given "source
IP".Now these features are not correct when I am dealing with a batch cause
the batch only has a micro view of the entire dataset.So I changed the code
and inserted another step 

 

3a). This step will accumulate the data for a given "Source IP" over
multiple batches.so far so good.

 

To achieve this I used a dataframe which is a "var" instead of a "val", and
as new batches came in, I extract the "SIP" based data and union it with the
existing dataframe and also do a bit of filtering as I do not want my data
to keep on increasing in size over time (keep only say 30 mins worth of
data). 

 

Now when I test the system I see that the "processing time" for each batch
keeps on continuously increasing, I understand it to go up until the 30 min
mark but at that point as data gets filtered based on time, the size of the
SIP rdd (DF) is almost constant but the processing time is increasing. This
leads to my streaming pipleline to eventually become unstable and the app
dies of OOM. (The receiver executor gets bloated and dies).

 

I have tested this for almost a week now and this line,

srcHostsDF.filter(srcHostsDF("last_updated_time") > ejectTime)
  .union(batInterDF)

  .persist()

Where "batInterDF" is the received batch and 'srcHostDF', is the Dataframe
that I keep data across batches. 

 

Shows up in the "spark  ui" as increasing over time. The size of
"srcHostsDF" is fairly constant, if so why should the time taken by persist
go up.

 

The other two calls that showup as increasing in time are srcHostsDF.count()
and srcHostDF.rdd Why should this be the case ?

 

Any clues on what is happening ??

 

I replaced the "persist" with a "repartition" and I still get the similar
results. The below image shows the "executor memory" growth, the app start a
lil after 18:20, the neon green line is the "receiver executor", the others
are the "process executors". There a 5 executors and 1 driver in all.  One
tick before 19:00 is where the size of the 'srcHostDF' stabilises.

 

 



 

Regards

-Ravi Gurram

 

 

 



How can I tell if a Spark job is successful or not?

2017-08-10 Thread ??????
I want to do some clean-ups after a Spark job is finished, and the action I 
would do depends on whether the job is successful or not.

So how where can I get the result for the job?

I already tried the SparkListener, it worked fine when the job is successful, 
but if the job fails, the listener seems not called.

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Hemanth Gudela
Yeah, installing HDFS in our environment is unfornutately going to take lot of 
time (approvals/planning etc). I will have to live with local FS for now.
The other option I had already tried is collect() and send everything to driver 
node. But my data volume is too huge for driver node to handle alone.

I’m now trying to split the data into multiple datasets, then collect 
individual dataset and write it to local FS on driver node (this approach slows 
down the spark job, but I hope it works).

Thank you,
Hemanth

From: Femi Anthony 
Date: Thursday, 10 August 2017 at 11.24
To: Hemanth Gudela 
Cc: "user@spark.apache.org" 
Subject: Re: spark.write.csv is not able write files to specified path, but is 
writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

Also, why are you trying to write results locally if you're not using a 
distributed file system ? Spark is geared towards writing to a distributed file 
system. I would suggest trying to collect() so the data is sent to the master 
and then do a write if the result set isn't too big, or repartition before 
trying to write (though I suspect this won't really help). You really should 
install HDFS if that is possible.

Sent from my iPhone

On Aug 10, 2017, at 3:58 AM, Hemanth Gudela 
> wrote:
Thanks for reply Femi!

I’m writing the file like this --> 
myDataFrame.write.mode("overwrite").csv("myFilePath")
There absolutely are no errors/warnings after the write.

_SUCCESS file is created on master node, but the problem of _temporary is 
noticed only on worked nodes.

I know spark.write.csv works best with HDFS, but with the current setup I have 
in my environment, I have to deal with spark write to node’s local file system 
and not to HDFS.

Regards,
Hemanth

From: Femi Anthony >
Date: Thursday, 10 August 2017 at 10.38
To: Hemanth Gudela 
>
Cc: "user@spark.apache.org" 
>
Subject: Re: spark.write.csv is not able write files to specified path, but is 
writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

Normally the _temporary directory gets deleted as part of the cleanup when the 
write is complete and a SUCCESS file is created. I suspect that the writes are 
not properly completed. How are you specifying the write ? Any error messages 
in the logs ?

On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela 
> wrote:
Hi,

I’m running spark on cluster mode containing 4 nodes, and trying to write CSV 
files to node’s local path (not HDFS).
I’m spark.write.csv to write CSV files.

On master node:
spark.write.csv creates a folder with csv file name and writes many files with 
part-r-000n suffix. This is okay for me, I can merge them later.
But on worker nodes:
spark.write.csv creates a folder with csv file name and writes 
many folders and files under _temporary/0/. This is not okay for me.
Could someone please suggest me what could have been going wrong in my 
settings/how to be able to write csv files to the specified folder, and not to 
subfolders (_temporary/0/task_xxx) in worker machines.

Thank you,
Hemanth




--
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre minds." 
- Albert Einstein.


Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Femi Anthony
Also, why are you trying to write results locally if you're not using a 
distributed file system ? Spark is geared towards writing to a distributed file 
system. I would suggest trying to collect() so the data is sent to the master 
and then do a write if the result set isn't too big, or repartition before 
trying to write (though I suspect this won't really help). You really should 
install HDFS if that is possible.

Sent from my iPhone

> On Aug 10, 2017, at 3:58 AM, Hemanth Gudela  
> wrote:
> 
> Thanks for reply Femi!
>  
> I’m writing the file like this à 
> myDataFrame.write.mode("overwrite").csv("myFilePath")
> There absolutely are no errors/warnings after the write.
>  
> _SUCCESS file is created on master node, but the problem of _temporary is 
> noticed only on worked nodes.
>  
> I know spark.write.csv works best with HDFS, but with the current setup I 
> have in my environment, I have to deal with spark write to node’s local file 
> system and not to HDFS.
>  
> Regards,
> Hemanth
>  
> From: Femi Anthony 
> Date: Thursday, 10 August 2017 at 10.38
> To: Hemanth Gudela 
> Cc: "user@spark.apache.org" 
> Subject: Re: spark.write.csv is not able write files to specified path, but 
> is writing to unintended subfolder _temporary/0/task_xxx folder on worker 
> nodes
>  
> Normally the _temporary directory gets deleted as part of the cleanup when 
> the write is complete and a SUCCESS file is created. I suspect that the 
> writes are not properly completed. How are you specifying the write ? Any 
> error messages in the logs ?
>  
> On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela  
> wrote:
> Hi,
>  
> I’m running spark on cluster mode containing 4 nodes, and trying to write CSV 
> files to node’s local path (not HDFS).
> I’m spark.write.csv to write CSV files.
>  
> On master node:
> spark.write.csv creates a folder with csv file name and writes many files 
> with part-r-000n suffix. This is okay for me, I can merge them later.
> But on worker nodes:
> spark.write.csv creates a folder with csv file name and 
> writes many folders and files under _temporary/0/. This is not okay for me.
> Could someone please suggest me what could have been going wrong in my 
> settings/how to be able to write csv files to the specified folder, and not 
> to subfolders (_temporary/0/task_xxx) in worker machines.
>  
> Thank you,
> Hemanth
>  
> 
> 
>  
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre 
> minds." - Albert Einstein.


Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Hemanth Gudela
Yes, I have tried with file:/// and the fullpath, as well as just the full path 
without file:/// prefix.
Spark session has been closed, no luck though ☹

Regards,
Hemanth

From: Femi Anthony 
Date: Thursday, 10 August 2017 at 11.06
To: Hemanth Gudela 
Cc: "user@spark.apache.org" 
Subject: Re: spark.write.csv is not able write files to specified path, but is 
writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

Is your filePath prefaced with file:/// and the full path or is it relative ?

You might also try calling close() on the Spark context or session the end of 
the program execution to try and ensure that cleanup is completed

Sent from my iPhone

On Aug 10, 2017, at 3:58 AM, Hemanth Gudela 
> wrote:
Thanks for reply Femi!

I’m writing the file like this --> 
myDataFrame.write.mode("overwrite").csv("myFilePath")
There absolutely are no errors/warnings after the write.

_SUCCESS file is created on master node, but the problem of _temporary is 
noticed only on worked nodes.

I know spark.write.csv works best with HDFS, but with the current setup I have 
in my environment, I have to deal with spark write to node’s local file system 
and not to HDFS.

Regards,
Hemanth

From: Femi Anthony >
Date: Thursday, 10 August 2017 at 10.38
To: Hemanth Gudela 
>
Cc: "user@spark.apache.org" 
>
Subject: Re: spark.write.csv is not able write files to specified path, but is 
writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

Normally the _temporary directory gets deleted as part of the cleanup when the 
write is complete and a SUCCESS file is created. I suspect that the writes are 
not properly completed. How are you specifying the write ? Any error messages 
in the logs ?

On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela 
> wrote:
Hi,

I’m running spark on cluster mode containing 4 nodes, and trying to write CSV 
files to node’s local path (not HDFS).
I’m spark.write.csv to write CSV files.

On master node:
spark.write.csv creates a folder with csv file name and writes many files with 
part-r-000n suffix. This is okay for me, I can merge them later.
But on worker nodes:
spark.write.csv creates a folder with csv file name and writes 
many folders and files under _temporary/0/. This is not okay for me.
Could someone please suggest me what could have been going wrong in my 
settings/how to be able to write csv files to the specified folder, and not to 
subfolders (_temporary/0/task_xxx) in worker machines.

Thank you,
Hemanth




--
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre minds." 
- Albert Einstein.


Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Femi Anthony
Is your filePath prefaced with file:/// and the full path or is it relative ?

You might also try calling close() on the Spark context or session the end of 
the program execution to try and ensure that cleanup is completed 

Sent from my iPhone

> On Aug 10, 2017, at 3:58 AM, Hemanth Gudela  
> wrote:
> 
> Thanks for reply Femi!
>  
> I’m writing the file like this à 
> myDataFrame.write.mode("overwrite").csv("myFilePath")
> There absolutely are no errors/warnings after the write.
>  
> _SUCCESS file is created on master node, but the problem of _temporary is 
> noticed only on worked nodes.
>  
> I know spark.write.csv works best with HDFS, but with the current setup I 
> have in my environment, I have to deal with spark write to node’s local file 
> system and not to HDFS.
>  
> Regards,
> Hemanth
>  
> From: Femi Anthony 
> Date: Thursday, 10 August 2017 at 10.38
> To: Hemanth Gudela 
> Cc: "user@spark.apache.org" 
> Subject: Re: spark.write.csv is not able write files to specified path, but 
> is writing to unintended subfolder _temporary/0/task_xxx folder on worker 
> nodes
>  
> Normally the _temporary directory gets deleted as part of the cleanup when 
> the write is complete and a SUCCESS file is created. I suspect that the 
> writes are not properly completed. How are you specifying the write ? Any 
> error messages in the logs ?
>  
> On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela  
> wrote:
> Hi,
>  
> I’m running spark on cluster mode containing 4 nodes, and trying to write CSV 
> files to node’s local path (not HDFS).
> I’m spark.write.csv to write CSV files.
>  
> On master node:
> spark.write.csv creates a folder with csv file name and writes many files 
> with part-r-000n suffix. This is okay for me, I can merge them later.
> But on worker nodes:
> spark.write.csv creates a folder with csv file name and 
> writes many folders and files under _temporary/0/. This is not okay for me.
> Could someone please suggest me what could have been going wrong in my 
> settings/how to be able to write csv files to the specified folder, and not 
> to subfolders (_temporary/0/task_xxx) in worker machines.
>  
> Thank you,
> Hemanth
>  
> 
> 
>  
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre 
> minds." - Albert Einstein.


Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Hemanth Gudela
Thanks for reply Femi!

I’m writing the file like this --> 
myDataFrame.write.mode("overwrite").csv("myFilePath")
There absolutely are no errors/warnings after the write.

_SUCCESS file is created on master node, but the problem of _temporary is 
noticed only on worked nodes.

I know spark.write.csv works best with HDFS, but with the current setup I have 
in my environment, I have to deal with spark write to node’s local file system 
and not to HDFS.

Regards,
Hemanth

From: Femi Anthony 
Date: Thursday, 10 August 2017 at 10.38
To: Hemanth Gudela 
Cc: "user@spark.apache.org" 
Subject: Re: spark.write.csv is not able write files to specified path, but is 
writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

Normally the _temporary directory gets deleted as part of the cleanup when the 
write is complete and a SUCCESS file is created. I suspect that the writes are 
not properly completed. How are you specifying the write ? Any error messages 
in the logs ?

On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela 
> wrote:
Hi,

I’m running spark on cluster mode containing 4 nodes, and trying to write CSV 
files to node’s local path (not HDFS).
I’m spark.write.csv to write CSV files.

On master node:
spark.write.csv creates a folder with csv file name and writes many files with 
part-r-000n suffix. This is okay for me, I can merge them later.
But on worker nodes:
spark.write.csv creates a folder with csv file name and writes 
many folders and files under _temporary/0/. This is not okay for me.
Could someone please suggest me what could have been going wrong in my 
settings/how to be able to write csv files to the specified folder, and not to 
subfolders (_temporary/0/task_xxx) in worker machines.

Thank you,
Hemanth




--
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre minds." 
- Albert Einstein.


Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Femi Anthony
Normally the* _temporary* directory gets deleted as part of the cleanup
when the write is complete and a SUCCESS file is created. I suspect that
the writes are not properly completed. How are you specifying the write ?
Any error messages in the logs ?

On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela 
wrote:

> Hi,
>
>
>
> I’m running spark on cluster mode containing 4 nodes, and trying to write
> CSV files to node’s local path (*not HDFS*).
>
> I’m spark.write.csv to write CSV files.
>
>
>
> *On master node*:
>
> spark.write.csv creates a folder with csv file name and writes many files
> with part-r-000n suffix. This is okay for me, I can merge them later.
>
> *But on worker nodes*:
>
> spark.write.csv creates a folder with csv file name and
> writes many folders and files under _temporary/0/. This is not okay for me.
>
> Could someone please suggest me what could have been going wrong in my
> settings/how to be able to write csv files to the specified folder, and not
> to subfolders (_temporary/0/task_xxx) in worker machines.
>
>
>
> Thank you,
>
> Hemanth
>
>
>



-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-10 Thread Hemanth Gudela
Hi,

I’m running spark on cluster mode containing 4 nodes, and trying to write CSV 
files to node’s local path (not HDFS).
I’m spark.write.csv to write CSV files.

On master node:
spark.write.csv creates a folder with csv file name and writes many files with 
part-r-000n suffix. This is okay for me, I can merge them later.
But on worker nodes:
spark.write.csv creates a folder with csv file name and writes 
many folders and files under _temporary/0/. This is not okay for me.
Could someone please suggest me what could have been going wrong in my 
settings/how to be able to write csv files to the specified folder, and not to 
subfolders (_temporary/0/task_xxx) in worker machines.

Thank you,
Hemanth



Re: Spark SVD benchmark for dense matrices

2017-08-10 Thread Anastasios Zouzias
Hi Jose,

Just to note that in the databricks blog they state that they compute the
top-5 singular vectors, not all singular values/vectors. Computing all is
much more computational intense.

Cheers,
Anastasios





Am 09.08.2017 15:19 schrieb "Jose Francisco Saray Villamizar" <
jsa...@gmail.com>:

Hi everyone,

I am trying to invert a 5000 x 5000 Dense Matrix (99% non-zeros), by using
SVD with an approach simmilar to :

https://stackoverflow.com/questions/29969521/how-to-
compute-the-inverse-of-a-rowmatrix-in-apache-spark

The time Im getting with SVD is close to 10 minutes what is very long for
me.

A benchmark for SVD is already given here

https://databricks.com/blog/2014/07/21/distributing-the-
singular-value-decomposition-with-spark.html

However, it seems they are using sparse matrices, thats why they get short
times.
Have anyone of you try to perform a SVD on a very dense big matrix . ?

Is this time normal ?

Thank you.

-- 
-- 
Buen dia, alegria !!
José Francisco Saray Villamizar
cel +33 6 13710693 <+33%206%2013%2071%2006%2093>
Lyon, France