Re: How to access line fileName in loading file using the textFile method

2018-09-24 Thread Maxim Gekk
> So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Maybe the input_file_name() function help you:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$@input_file_name():org.apache.spark.sql.Column

On Mon, Sep 24, 2018 at 2:54 PM Soheil Pourbafrani 
wrote:

> Hi, My text data are in the form of text file. In the processing logic, I
> need to know each word is from which file. Actually, I need to tokenize the
> words and create the pair of . The naive solution is to
> call sc.textFile for each file and having the fileName in a variable,
> create the pairs, but it's not efficient and I got the StackOverflow error
> as dataset grew.
>
> So my question is supposing all files are in a directory and I read then
> using sc.textFile("path/*"), how can I understand each data is for which
> file?
>
> Is it possible (and needed) to customize the textFile method?
>


-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.g...@databricks.com

databricks.com

  


Re: How to access line fileName in loading file using the textFile method

2018-09-24 Thread Jörn Franke
You can create your own data source exactly doing this. 

Why is the file name important if the file content is the same?

> On 24. Sep 2018, at 13:53, Soheil Pourbafrani  wrote:
> 
> Hi, My text data are in the form of text file. In the processing logic, I 
> need to know each word is from which file. Actually, I need to tokenize the 
> words and create the pair of . The naive solution is to call 
> sc.textFile for each file and having the fileName in a variable, create the 
> pairs, but it's not efficient and I got the StackOverflow error as dataset 
> grew.
> 
> So my question is supposing all files are in a directory and I read then 
> using sc.textFile("path/*"), how can I understand each data is for which file?
> 
> Is it possible (and needed) to customize the textFile method?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How to access line fileName in loading file using the textFile method

2018-09-24 Thread Soheil Pourbafrani
Hi, My text data are in the form of text file. In the processing logic, I
need to know each word is from which file. Actually, I need to tokenize the
words and create the pair of . The naive solution is to
call sc.textFile for each file and having the fileName in a variable,
create the pairs, but it's not efficient and I got the StackOverflow error
as dataset grew.

So my question is supposing all files are in a directory and I read then
using sc.textFile("path/*"), how can I understand each data is for which
file?

Is it possible (and needed) to customize the textFile method?


Apache Spark and Airflow connection

2018-09-24 Thread Uğur Sopaoğlu
I have a docker based cluster. In my cluster, I try to schedule spark jobs
by using Airflow. Airflow and Spark are running separately in *different
containers*.  However, I cannot run a spark job by using airflow.

Below the code is my airflow script:

from airflow import DAG

from airflow.contrib.operators.spark_submit_operator import
SparkSubmitOperator
from datetime import datetime, timedelta


args = {'owner': 'airflow', 'start_date': datetime(2018, 7, 31) }

dag = DAG('spark_example_new', default_args=args, schedule_interval="@once")

operator = SparkSubmitOperator(task_id='spark_submit_job',
conn_id='spark_default', java_class='Main', application='/SimpleSpark.jar',
name='airflow-spark-example',
dag=dag)

I also configure spark_default in Airflow UI:

[image: Screenshot from 2018-09-24 12-00-46.png]


However, it produce following error:

[Errno 2] No such file or directory: 'spark-submit': 'spark-submit'

I think, airflow try to run spark job in own. How can I configure that it
runs spark code on spark master.

-- 
Uğur Sopaoğlu


Yarn log aggregation of spark streaming job

2018-09-24 Thread ayushChauhan
By default, YARN aggregates logs after an application completes. But I am
trying aggregate logs for spark streaming job which in theory will run
forever. I have set the property the following properties for log
aggregation and restarted yarn by restarting hadoop-yarn-nodemanager on core
& task nodes and hadoop-yarn-resourcemanager on master node on my emr
cluster. I can view my changes in http://node-ip:8088/conf.

yarn.log-aggregation-enable => true
yarn.log-aggregation.retain-seconds => 172800
yarn.log-aggregation.retain-check-interval-seconds => -1
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds => 3600

All the articles and resources have only mentioned to include
yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds property
and yarn will starting aggregating logs for running jobs. But it is not
working in my case.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Απάντηση: Re: Question about Spark cluster memory usage monitoring

2018-09-24 Thread kolokasis


Hello,a good tool to manage the cluster usage for each mode is using Grafana 
which use Prometheus in the background to collect metrics. 
-Iacovos


Εστάλη από τη συσκευή μου Samsung

 Αρχικό μήνυμα 
Από: "Jan Brabec (janbrabe)"  
Ημερομηνία: 24/9/18  08:57  (GMT+01:00) 
Προς: Muhib Khan , "Liu, Jialin"  
Κοιν.: user@spark.apache.org 
Θέμα: Re: Question about Spark cluster memory usage monitoring 



Hello,
 
Maybe Ganglia http://ganglia.info/ might be useful for you? I have only shallow 
experience with it, but it might be what you are looking for.
 
Best,
Jan
 

From: Muhib Khan 

Date: Thursday 20 September 2018 at 23:39

To: "Liu, Jialin" 

Cc: "user@spark.apache.org" 

Subject: Re: Question about Spark cluster memory usage monitoring


 


Hello, 

 


As far as I know, there is no API provided for tracking the execution memory of 
a Spark Worker node. For tracking the execution memory you will probably need 
to access the MemoryManager's onHeapExecutionMemoryPool and 
offHeapExecutionMemoryPool
 objects that track the memory allocated to tasks for execution memory and 
write to a log the amount for parsing it later or send it to the driver in a 
periodic interval to collect the memory usage in a central location. Either 
way, I think tracking execution
 memory may require code changes in Apache Spark. Hope this helps.


 


Regards,


Muhib


Ph.D. Student, FSU


 


 


 


On Thu, Sep 20, 2018 at 4:46 PM Liu, Jialin  wrote:


Hi there,



I am currently using Spark cluster to run jobs but I really need to collect the 
history of actually memory usage(that’s execution memory + storage memory) of 
the job in the whole cluster. I know we can get the storage memory usage 
through either Spark UI Executor
 page or SparkContext.getMemoryExecutorStatus() API, but I could not get the 
real time execution memory usage.


Is there anyway I can collect total memory usage? Thank you so much!



Best,

Jialin Liu

-

To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





[Spark UI] find driver for an application

2018-09-24 Thread bsikander
Hello,
I am having some troubles in using Spark Master UI to figure out some basic
information.
The process is too tedious.
I am using spark 2.2.1 with Spark standalone.

- In cluster mode, how to figure out which driver is related to which
application?
- In supervise mode, how to track the restarts? How many times it was
restarted, the app id of all the applications after restart and VM IP where
it was running.

Any help would be much appreciated.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to read multiple libsvm files in Spark?

2018-09-24 Thread Maxim Gekk
Hi,

> Any other alternatives?

Manually form the input path by combining multiple paths via dots. See
https://issues.apache.org/jira/browse/SPARK-12086

On Thu, Sep 20, 2018 at 12:47 PM Md. Rezaul Karim <
rezaul.ka...@insight-centre.org> wrote:

> I'm experiencing "Exception in thread "main" java.io.IOException: Multiple
> input paths are not supported for libsvm data" exception while trying to
> read multiple libsvm files using Spark 2.3.0:
>
> val URLs =
> spark.read.format("libsvm").load("url_svmlight.tar/url_svmlight/*.svm")
>
> Any other alternatives?
>

  


Re: Question about Spark cluster memory usage monitoring

2018-09-24 Thread Jan Brabec (janbrabe)
Hello,

Maybe Ganglia http://ganglia.info/ might be useful for you? I have only shallow 
experience with it, but it might be what you are looking for.

Best,
Jan

From: Muhib Khan 
Date: Thursday 20 September 2018 at 23:39
To: "Liu, Jialin" 
Cc: "user@spark.apache.org" 
Subject: Re: Question about Spark cluster memory usage monitoring

Hello,

As far as I know, there is no API provided for tracking the execution memory of 
a Spark Worker node. For tracking the execution memory you will probably need 
to access the MemoryManager's onHeapExecutionMemoryPool and 
offHeapExecutionMemoryPool objects that track the memory allocated to tasks for 
execution memory and write to a log the amount for parsing it later or send it 
to the driver in a periodic interval to collect the memory usage in a central 
location. Either way, I think tracking execution memory may require code 
changes in Apache Spark. Hope this helps.

Regards,
Muhib
Ph.D. Student, FSU



On Thu, Sep 20, 2018 at 4:46 PM Liu, Jialin 
mailto:jial...@illinois.edu>> wrote:
Hi there,

I am currently using Spark cluster to run jobs but I really need to collect the 
history of actually memory usage(that’s execution memory + storage memory) of 
the job in the whole cluster. I know we can get the storage memory usage 
through either Spark UI Executor page or SparkContext.getMemoryExecutorStatus() 
API, but I could not get the real time execution memory usage.
Is there anyway I can collect total memory usage? Thank you so much!

Best,
Jialin Liu
-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org