Re: Is there a way to merge parquet small files?

2016-05-19 Thread Gavin Yue
For logs file I would suggest save as gziped text file first.  After 
aggregation, convert them into parquet by merging a few files. 



> On May 19, 2016, at 22:32, Deng Ching-Mallete  wrote:
> 
> IMO, it might be better to merge or compact the parquet files instead of 
> keeping lots of small files in the HDFS. Please refer to [1] for more info. 
> 
> We also encountered the same issue with the slow query, and it was indeed 
> caused by the many small parquet files. In our case, we were processing large 
> data sets with batch jobs instead of a streaming job. To solve our issue, we 
> just did a coalesce to reduce the number of partitions before saving as 
> parquet format. 
> 
> HTH,
> Deng
> 
> [1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
> 
>> On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515  
>> wrote:
>> I’m using a spark streaming program to store log message into parquet file 
>> every 10 mins.
>> Now, when I query the parquet, it usually takes hundreds of thousands of 
>> stages to compute a single count.
>> I looked into the parquet file’s path and find a great amount of small files.
>> 
>> Do the small files caused the problem? Can I merge them, or is there a 
>> better way to solve it?
>> 
>> Lots of thanks.
>> 
>> 
>> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


Re: Is there a way to merge parquet small files?

2016-05-19 Thread Deng Ching-Mallete
IMO, it might be better to merge or compact the parquet files instead of
keeping lots of small files in the HDFS. Please refer to [1] for more info.

We also encountered the same issue with the slow query, and it was indeed
caused by the many small parquet files. In our case, we were processing
large data sets with batch jobs instead of a streaming job. To solve our
issue, we just did a coalesce to reduce the number of partitions before
saving as parquet format.

HTH,
Deng

[1] http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

On Fri, May 20, 2016 at 1:50 PM, 王晓龙/0515 
wrote:

> I’m using a spark streaming program to store log message into parquet file
> every 10 mins.
> Now, when I query the parquet, it usually takes hundreds of thousands of
> stages to compute a single count.
> I looked into the parquet file’s path and find a great amount of small
> files.
>
> Do the small files caused the problem? Can I merge them, or is there a
> better way to solve it?
>
> Lots of thanks.
>
> 
>
> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Is there a way to merge parquet small files?

2016-05-19 Thread Alexander Pivovarov
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to
control RDD partition size
I heard that DF can several files in 1 task


On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515 
wrote:

> I’m using a spark streaming program to store log message into parquet file
> every 10 mins.
> Now, when I query the parquet, it usually takes hundreds of thousands of
> stages to compute a single count.
> I looked into the parquet file’s path and find a great amount of small
> files.
>
> Do the small files caused the problem? Can I merge them, or is there a
> better way to solve it?
>
> Lots of thanks.
>
> 
>
> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is there a way to merge parquet small files?

2016-05-19 Thread 王晓龙/01111515
I’m using a spark streaming program to store log message into parquet file 
every 10 mins.
Now, when I query the parquet, it usually takes hundreds of thousands of stages 
to compute a single count.
I looked into the parquet file’s path and find a great amount of small files.

Do the small files caused the problem? Can I merge them, or is there a better 
way to solve it?

Lots of thanks.


此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Query about how to estimate cpu usage for spark

2016-05-19 Thread Wang Jiaye
For MR job, there is job counter to provide CPU ms information while I
cannot find a similar metrics in Spark which is quite useful. Do anyone
know about this?


Re: Does spark support Apache Arrow

2016-05-19 Thread Hyukjin Kwon
FYI, there is a JIRA for this,
https://issues.apache.org/jira/browse/SPARK-13534

I hope this link is helpful.

Thanks!


2016-05-20 11:18 GMT+09:00 Sun Rui :

> 1. I don’t think so
> 2. Arrow is for in-memory columnar execution. While cache is for in-memory
> columnar storage
>
> On May 20, 2016, at 10:16, Todd  wrote:
>
> From the official site http://arrow.apache.org/, Apache Arrow is used for
> Columnar In-Memory storage. I have two quick questions:
> 1. Does spark support Apache Arrow?
> 2. When dataframe is cached in memory, the data are saved in columnar
> in-memory style. What is the relationship between this feature and Apache
> Arrow,that is,
> when the data is in Apache Arrow format,does spark still need the effort
> to cache the dataframe in columnar in-memory?
>
>
>


Re: Does spark support Apache Arrow

2016-05-19 Thread Sun Rui
1. I don’t think so
2. Arrow is for in-memory columnar execution. While cache is for in-memory 
columnar storage
> On May 20, 2016, at 10:16, Todd  wrote:
> 
> From the official site http://arrow.apache.org/, Apache Arrow is used for 
> Columnar In-Memory storage. I have two quick questions:
> 1. Does spark support Apache Arrow?
> 2. When dataframe is cached in memory, the data are saved in columnar 
> in-memory style. What is the relationship between this feature and Apache 
> Arrow,that is,
> when the data is in Apache Arrow format,does spark still need the effort to 
> cache the dataframe in columnar in-memory?



Does spark support Apache Arrow

2016-05-19 Thread Todd
From the official site http://arrow.apache.org/, Apache Arrow is used for 
Columnar In-Memory storage. I have two quick questions:
1. Does spark support Apache Arrow?
2. When dataframe is cached in memory, the data are saved in columnar in-memory 
style. What is the relationship between this feature and Apache Arrow,that is,
when the data is in Apache Arrow format,does spark still need the effort to 
cache the dataframe in columnar in-memory?

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
Sure. You can try pySpark, which is the Python API of Spark.
> On May 20, 2016, at 06:20, ayan guha  wrote:
> 
> Hi
> 
> Thanks for the input. Can it be possible to write it in python? I think I can 
> use FileUti.untar from hdfs jar. But can I do it from python?
> 
> On 19 May 2016 16:57, "Sun Rui"  > wrote:
> 1. create a temp dir on HDFS, say “/tmp”
> 2. write a script to create in the temp dir one file for each tar file. Each 
> file has only one line:
> 
> 3. Write a spark application. It is like:
>   val rdd = sc.textFile ()
>   rdd.map { line =>
>construct an untar command using the path information in “line” and 
> launches the command
>   }
> 
> > On May 19, 2016, at 14:42, ayan guha  > > wrote:
> >
> > Hi
> >
> > I have few tar files in HDFS in a single folder. each file has multiple 
> > files in it.
> >
> > tar1:
> >   - f1.txt
> >   - f2.txt
> > tar2:
> >   - f1.txt
> >   - f2.txt
> >
> > (each tar file will have exact same number of files, same name)
> >
> > I am trying to find a way (spark or pig) to extract them to their own 
> > folders.
> >
> > f1
> >   - tar1_f1.txt
> >   - tar2_f1.txt
> > f2:
> >- tar1_f2.txt
> >- tar1_f2.txt
> >
> > Any help?
> >
> >
> >
> > --
> > Best Regards,
> > Ayan Guha
> 
> 



Re: dataframe stat corr for multiple columns

2016-05-19 Thread Sun Rui
There is an existing JIRA issue for it: 
https://issues.apache.org/jira/browse/SPARK-11057 

Also there is an PR. Maybe we should help to review and merge it with a higher 
priority.
> On May 20, 2016, at 00:09, Xiangrui Meng  wrote:
> 
> This is nice to have. Please create a JIRA for it. Right now, you can merge 
> all columns into a vector column using RFormula or VectorAssembler, then 
> convert it into an RDD and call corr from MLlib.
> 
> 
> On Tue, May 17, 2016, 7:09 AM Ankur Jain  > wrote:
> Hello Team,
> 
>  
> 
> In my current usecase I am loading data from CSV using spark-csv and trying 
> to correlate all variables.
> 
>  
> 
> As of now if we want to correlate 2 column in a dataframe df.stat.corr works 
> great but if we want to correlate multiple columns this won’t work.
> 
> In case of R we can use corrplot and correlate all numeric columns in a 
> single line of code. Can you guide me how to achieve the same with dataframe 
> or sql?
> 
>  
> 
> There seems a way in spark-mllib
> 
> http://spark.apache.org/docs/latest/mllib-statistics.html 
> 
>  
> 
> 
> 
>  
> 
> But it seems that it don’t take input as dataframe…
> 
>  
> 
> Regards,
> 
> Ankur
> 
> Information transmitted by this e-mail is proprietary to YASH Technologies 
> and/ or its Customers and is intended for use only by the individual or 
> entity to which it is addressed, and may contain information that is 
> privileged, confidential or exempt from disclosure under applicable law. If 
> you are not the intended recipient or it appears that this mail has been 
> forwarded to you without proper authority, you are notified that any use or 
> dissemination of this information in any manner is strictly prohibited. In 
> such cases, please notify us immediately at i...@yash.com 
>  and delete this mail from your records.
>  邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
> 共有 2 个附件
> image001.png(10K)
> 极速下载 
> 
>  在线预览 
> 
> image001.png(10K)
> 极速下载 
> 
>  在线预览 
> 



Re: Starting executor without a master

2016-05-19 Thread Marcelo Vanzin
On Thu, May 19, 2016 at 6:06 PM, Mathieu Longtin  wrote:
> I'm looking to bypass the master entirely. I manage the workers outside of
> Spark. So I want to start the driver, the start workers that connect
> directly to the driver.

It should be possible to do that if you extend the interface I
mentioned. I didn't mean "master" the daemon process, I meant the
github branch of Spark.


-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
I'm looking to bypass the master entirely. I manage the workers outside of
Spark. So I want to start the driver, the start workers that connect
directly to the driver.

Anyway, it looks like I will have to live with our current solution for a
while.

On Thu, May 19, 2016 at 8:32 PM Marcelo Vanzin  wrote:

> Hi Mathieu,
>
> There's nothing like that in Spark currently. For that, you'd need a
> new cluster manager implementation that knows how to start executors
> in those remote machines (e.g. by running ssh or something).
>
> In the current master there's an interface you can implement to try
> that if you really want to (ExternalClusterManager), but it's
> currently "private[spark]" and it probably wouldn't be a very simple
> task.
>
>
> On Thu, May 19, 2016 at 10:45 AM, Mathieu Longtin
>  wrote:
> > First a bit of context:
> > We use Spark on a platform where each user start workers as needed. This
> has
> > the advantage that all permission management is handled by the OS, so the
> > users can only read files they have permission to.
> >
> > To do this, we have some utility that does the following:
> > - start a master
> > - start worker managers on a number of servers
> > - "submit" the Spark driver program
> > - the driver then talks to the master, tell it how many executors it
> needs
> > - the master tell the worker nodes to start executors and talk to the
> driver
> > - the executors are started
> >
> > From here on, the master doesn't do much, neither do the process manager
> on
> > the worker nodes.
> >
> > What I would like to do is simplify this to:
> > - Start the driver program
> > - Start executors on a number of servers, telling them where to find the
> > driver
> > - The executors connect directly to the driver
> >
> > Is there a way I could do this without the master and worker managers?
> >
> > Thanks!
> >
> >
> > --
> > Mathieu Longtin
> > 1-514-803-8977
>
>
>
> --
> Marcelo
>
-- 
Mathieu Longtin
1-514-803-8977


Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Okay:
*host=my.local.server*
*port=someport*

This is the spark-submit command, which runs on my local server:
*$SPARK_HOME/bin/spark-submit --master spark://$host:$port
--executor-memory 4g python-script.py with args*

If I want 200 worker cores, I tell the cluster scheduler to run this
command on 200 cores:
*$SPARK_HOME/sbin/start-slave.sh --cores=1 --memory=4g spark://$host:$port *

That's it. When the task starts, it uses all available workers. If for some
reason, not enough cores are available immediately, it still starts
processing with whatever it gets and the load will be spread further as
workers come online.


On Thu, May 19, 2016 at 8:24 PM Mich Talebzadeh 
wrote:

> In a normal operation we tell spark which node the worker processes can
> run by adding the nodenames to conf/slaves.
>
> Not very clear on this in your case all the jobs run locally with say 100
> executor cores like below:
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[*] \
>
> --driver-memory xg \  --default would be 512M
>
> --num-executors=1 \   -- This is the constraint in
> stand-alone Spark cluster, whether specified or not
>
> --executor-memory=xG \ --
>
> --executor-cores=n \
>
> --master local[*] means all cores and --executor-cores in your case need
> not be specified? or you can cap it like above --executor-cores=n. If it
> is not specified then the Spark app will go and grab every core. Although
> in practice that does not happen it is just an upper ceiling. It is FIFO.
>
> What typical executor memory is specified in your case?
>
> Do you have a  sample snapshot of spark-submit job by any chance Mathieu?
>
> Cheers
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 May 2016 at 00:27, Mathieu Longtin  wrote:
>
>> Mostly, the resource management is not up to the Spark master.
>>
>> We routinely start 100 executor-cores for 5 minute job, and they just
>> quit when they are done. Then those processor cores can do something else
>> entirely, they are not reserved for Spark at all.
>>
>> On Thu, May 19, 2016 at 4:55 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Then in theory every user can fire multiple spark-submit jobs. do you
>>> cap it with settings in  $SPARK_HOME/conf/spark-defaults.conf , but I
>>> guess in reality every user submits one job only.
>>>
>>> This is an interesting model for two reasons:
>>>
>>>
>>>- It uses parallel processing across all the nodes or most of the
>>>nodes to minimise the processing time
>>>- it requires less intervention
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 21:33, Mathieu Longtin  wrote:
>>>
 Driver memory is default. Executor memory depends on job, the caller
 decides how much memory to use. We don't specify --num-executors as we want
 all cores assigned to the local master, since they were started by the
 current user. No local executor.  --master=spark://localhost:someport. 1
 core per executor.

 On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Thanks Mathieu
>
> So it would be interesting to see what resources allocated in your
> case, especially the num-executors and executor-cores. I gather every node
> has enough memory and cores.
>
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[2] \
>
> --driver-memory 4g \
>
> --num-executors=1 \
>
> --executor-memory=4G \
>
> --executor-cores=2 \
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 21:02, Mathieu Longtin 
> wrote:
>
>> The driver (the process started by spark-submit) runs locally. The
>> executors run on any of thousands of servers. So far, I haven't tried 
>> more
>> than 500 executors.
>>
>> Right now, I run a master on the same server as the driver.
>>
>> On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:

Re: Starting executor without a master

2016-05-19 Thread Marcelo Vanzin
Hi Mathieu,

There's nothing like that in Spark currently. For that, you'd need a
new cluster manager implementation that knows how to start executors
in those remote machines (e.g. by running ssh or something).

In the current master there's an interface you can implement to try
that if you really want to (ExternalClusterManager), but it's
currently "private[spark]" and it probably wouldn't be a very simple
task.


On Thu, May 19, 2016 at 10:45 AM, Mathieu Longtin
 wrote:
> First a bit of context:
> We use Spark on a platform where each user start workers as needed. This has
> the advantage that all permission management is handled by the OS, so the
> users can only read files they have permission to.
>
> To do this, we have some utility that does the following:
> - start a master
> - start worker managers on a number of servers
> - "submit" the Spark driver program
> - the driver then talks to the master, tell it how many executors it needs
> - the master tell the worker nodes to start executors and talk to the driver
> - the executors are started
>
> From here on, the master doesn't do much, neither do the process manager on
> the worker nodes.
>
> What I would like to do is simplify this to:
> - Start the driver program
> - Start executors on a number of servers, telling them where to find the
> driver
> - The executors connect directly to the driver
>
> Is there a way I could do this without the master and worker managers?
>
> Thanks!
>
>
> --
> Mathieu Longtin
> 1-514-803-8977



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: KafkaUtils.createDirectStream Not Fetching Messages with Confluent Serializers as Value Decoder.

2016-05-19 Thread Mail.com
Hi Muthu,

Do you have Kerberos enabled?

Thanks,
Pradeep

> On May 19, 2016, at 12:17 AM, Ramaswamy, Muthuraman 
>  wrote:
> 
> I am using Spark 1.6.1 and Kafka 0.9+ It works for both receiver and 
> receiver-less mode.
> 
> One thing I noticed when you specify invalid topic name, KafkaUtils doesn't 
> fetch any messages. So, check you have specified the topic name correctly.
> 
> ~Muthu
> 
> From: Mail.com [pradeep.mi...@mail.com]
> Sent: Monday, May 16, 2016 9:33 PM
> To: Ramaswamy, Muthuraman
> Cc: Cody Koeninger; spark users
> Subject: Re: KafkaUtils.createDirectStream Not Fetching Messages with 
> Confluent Serializers as Value Decoder.
> 
> Hi Muthu,
> 
> Are you on spark 1.4.1 and Kafka 0.8.2? I have a similar issue even for 
> simple string messages.
> 
> Console producer and consumer work fine. But spark always reruns empty RDD. I 
> am using Receiver based Approach.
> 
> Thanks,
> Pradeep
> 
>> On May 16, 2016, at 8:19 PM, Ramaswamy, Muthuraman 
>>  wrote:
>> 
>> Yes, I can see the messages. Also, I wrote a quick custom decoder for avro 
>> and it works fine for the following:
>> 
 kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": 
 brokers}, valueDecoder=decoder)
>> 
>> But, when I use the Confluent Serializers to leverage the Schema Registry 
>> (based on the link shown below), it doesn’t work for me. I am not sure 
>> whether I need to configure any more details to consume the Schema Registry. 
>> I can fetch the schema from the schema registry based on is Ids. The decoder 
>> method is not returning any values for me.
>> 
>> ~Muthu
>> 
>> 
>> 
>>> On 5/16/16, 10:49 AM, "Cody Koeninger"  wrote:
>>> 
>>> Have you checked to make sure you can receive messages just using a
>>> byte array for value?
>>> 
>>> On Mon, May 16, 2016 at 12:33 PM, Ramaswamy, Muthuraman
>>>  wrote:
 I am trying to consume AVRO formatted message through
 KafkaUtils.createDirectStream. I followed the listed below example (refer
 link) but the messages are not being fetched by the Stream.
 
 https://urldefense.proofpoint.com/v2/url?u=http-3A__stackoverflow.com_questions_30339636_spark-2Dpython-2Davro-2Dkafka-2Ddeserialiser=CwIBaQ=jcv3orpCsv7C4ly8-ubDob57ycZ4jvhoYZNDBA06fPk=NQ-dw5X8CJcqaXIvIdMUUdkL0fHDonD07FZzTY3CgiU=Nc-rPMFydyCrwOZuNWs2GmSL4NkN8eGoR-mkJUlkCx0=hwqxCKl3P4_9pKWeo1OGR134QegMRe3Xh22_WMy-5q8=
 
 Is there any code missing that I must add to make the above sample work.
 Say, I am not sure how the confluent serializers would know the avro schema
 info as it knows only the Schema Registry URL info.
 
 Appreciate your help.
 
 ~Muthu
>> ?B‹CB•?È?[œÝXœØÜšX™K??K[XZ[?ˆ?\Ù\‹][œÝXœØÜšX™P?Ü?\šË˜\?XÚ?K›Ü™ÃB‘›Üˆ?Y??]?[Û˜[??ÛÛ[X[™?Ë??K[XZ[?ˆ?\Ù\‹Z?[???Ü?\šË˜\?XÚ?K›Ü™ÃBƒB
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Filter out the elements from xml file in Spark

2016-05-19 Thread Mail.com
Hi Yogesh,

Can you try map operation and get what you need. Whatever parser you are using. 
You could also look at spark-XML package . 

Thanks,
Pradeep
> On May 19, 2016, at 4:39 AM, Yogesh Vyas  wrote:
> 
> Hi,
> I had xml files which I am reading through textFileStream, and then
> filtering out the required elements using traditional conditions and
> loops. I would like to know if  there is any specific packages or
> functions provided in spark to perform operations on RDD of xml?
> 
> Regards,
> Yogesh
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
In a normal operation we tell spark which node the worker processes can run
by adding the nodenames to conf/slaves.

Not very clear on this in your case all the jobs run locally with say 100
executor cores like below:


${SPARK_HOME}/bin/spark-submit \

--master local[*] \

--driver-memory xg \  --default would be 512M

--num-executors=1 \   -- This is the constraint in
stand-alone Spark cluster, whether specified or not

--executor-memory=xG \ --

--executor-cores=n \

--master local[*] means all cores and --executor-cores in your case need
not be specified? or you can cap it like above --executor-cores=n. If it is
not specified then the Spark app will go and grab every core. Although in
practice that does not happen it is just an upper ceiling. It is FIFO.

What typical executor memory is specified in your case?

Do you have a  sample snapshot of spark-submit job by any chance Mathieu?

Cheers


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 20 May 2016 at 00:27, Mathieu Longtin  wrote:

> Mostly, the resource management is not up to the Spark master.
>
> We routinely start 100 executor-cores for 5 minute job, and they just quit
> when they are done. Then those processor cores can do something else
> entirely, they are not reserved for Spark at all.
>
> On Thu, May 19, 2016 at 4:55 PM Mich Talebzadeh 
> wrote:
>
>> Then in theory every user can fire multiple spark-submit jobs. do you cap
>> it with settings in  $SPARK_HOME/conf/spark-defaults.conf , but I guess
>> in reality every user submits one job only.
>>
>> This is an interesting model for two reasons:
>>
>>
>>- It uses parallel processing across all the nodes or most of the
>>nodes to minimise the processing time
>>- it requires less intervention
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 May 2016 at 21:33, Mathieu Longtin  wrote:
>>
>>> Driver memory is default. Executor memory depends on job, the caller
>>> decides how much memory to use. We don't specify --num-executors as we want
>>> all cores assigned to the local master, since they were started by the
>>> current user. No local executor.  --master=spark://localhost:someport. 1
>>> core per executor.
>>>
>>> On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thanks Mathieu

 So it would be interesting to see what resources allocated in your
 case, especially the num-executors and executor-cores. I gather every node
 has enough memory and cores.



 ${SPARK_HOME}/bin/spark-submit \

 --master local[2] \

 --driver-memory 4g \

 --num-executors=1 \

 --executor-memory=4G \

 --executor-cores=2 \

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 19 May 2016 at 21:02, Mathieu Longtin 
 wrote:

> The driver (the process started by spark-submit) runs locally. The
> executors run on any of thousands of servers. So far, I haven't tried more
> than 500 executors.
>
> Right now, I run a master on the same server as the driver.
>
> On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> ok so you are using some form of NFS mounted file system shared among
>> the nodes and basically you start the processes through spark-submit.
>>
>> In Stand-alone mode, a simple cluster manager included with Spark. It
>> does the management of resources so it is not clear to me what you are
>> referring as worker manager here?
>>
>> This is my take from your model.
>>  The application will go and grab all the cores in the cluster.
>> You only have one worker that lives within the driver JVM process.
>> The Driver node runs on the same host that the cluster manager is
>> running. The Driver requests the Cluster Manager for resources to run
>> tasks. In this case there is only one executor for the Driver? The 
>> Executor
>> runs tasks for the Driver.
>>
>>
>> HTH
>>
>> Dr Mich 

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Mostly, the resource management is not up to the Spark master.

We routinely start 100 executor-cores for 5 minute job, and they just quit
when they are done. Then those processor cores can do something else
entirely, they are not reserved for Spark at all.

On Thu, May 19, 2016 at 4:55 PM Mich Talebzadeh 
wrote:

> Then in theory every user can fire multiple spark-submit jobs. do you cap
> it with settings in  $SPARK_HOME/conf/spark-defaults.conf , but I guess
> in reality every user submits one job only.
>
> This is an interesting model for two reasons:
>
>
>- It uses parallel processing across all the nodes or most of the
>nodes to minimise the processing time
>- it requires less intervention
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 21:33, Mathieu Longtin  wrote:
>
>> Driver memory is default. Executor memory depends on job, the caller
>> decides how much memory to use. We don't specify --num-executors as we want
>> all cores assigned to the local master, since they were started by the
>> current user. No local executor.  --master=spark://localhost:someport. 1
>> core per executor.
>>
>> On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Thanks Mathieu
>>>
>>> So it would be interesting to see what resources allocated in your case,
>>> especially the num-executors and executor-cores. I gather every node has
>>> enough memory and cores.
>>>
>>>
>>>
>>> ${SPARK_HOME}/bin/spark-submit \
>>>
>>> --master local[2] \
>>>
>>> --driver-memory 4g \
>>>
>>> --num-executors=1 \
>>>
>>> --executor-memory=4G \
>>>
>>> --executor-cores=2 \
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 21:02, Mathieu Longtin  wrote:
>>>
 The driver (the process started by spark-submit) runs locally. The
 executors run on any of thousands of servers. So far, I haven't tried more
 than 500 executors.

 Right now, I run a master on the same server as the driver.

 On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> ok so you are using some form of NFS mounted file system shared among
> the nodes and basically you start the processes through spark-submit.
>
> In Stand-alone mode, a simple cluster manager included with Spark. It
> does the management of resources so it is not clear to me what you are
> referring as worker manager here?
>
> This is my take from your model.
>  The application will go and grab all the cores in the cluster.
> You only have one worker that lives within the driver JVM process.
> The Driver node runs on the same host that the cluster manager is
> running. The Driver requests the Cluster Manager for resources to run
> tasks. In this case there is only one executor for the Driver? The 
> Executor
> runs tasks for the Driver.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 20:37, Mathieu Longtin 
> wrote:
>
>> No master and no node manager, just the processes that do actual work.
>>
>> We use the "stand alone" version because we have a shared file system
>> and a way of allocating computing resources already (Univa Grid Engine). 
>> If
>> an executor were to die, we have other ways of restarting it, we don't 
>> need
>> the worker manager to deal with it.
>>
>> On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Mathieu
>>>
>>> What does this approach provide that the norm lacks?
>>>
>>> So basically each node has its master in this model.
>>>
>>> Are these supposed to be individual stand alone servers?
>>>
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>

Re: Tar File: On Spark

2016-05-19 Thread Ted Yu
See http://memect.co/call-java-from-python-so

You can also use Py4J

On Thu, May 19, 2016 at 3:20 PM, ayan guha  wrote:

> Hi
>
> Thanks for the input. Can it be possible to write it in python? I think I
> can use FileUti.untar from hdfs jar. But can I do it from python?
> On 19 May 2016 16:57, "Sun Rui"  wrote:
>
>> 1. create a temp dir on HDFS, say “/tmp”
>> 2. write a script to create in the temp dir one file for each tar file.
>> Each file has only one line:
>> 
>> 3. Write a spark application. It is like:
>>   val rdd = sc.textFile ()
>>   rdd.map { line =>
>>construct an untar command using the path information in “line”
>> and launches the command
>>   }
>>
>> > On May 19, 2016, at 14:42, ayan guha  wrote:
>> >
>> > Hi
>> >
>> > I have few tar files in HDFS in a single folder. each file has multiple
>> files in it.
>> >
>> > tar1:
>> >   - f1.txt
>> >   - f2.txt
>> > tar2:
>> >   - f1.txt
>> >   - f2.txt
>> >
>> > (each tar file will have exact same number of files, same name)
>> >
>> > I am trying to find a way (spark or pig) to extract them to their own
>> folders.
>> >
>> > f1
>> >   - tar1_f1.txt
>> >   - tar2_f1.txt
>> > f2:
>> >- tar1_f2.txt
>> >- tar1_f2.txt
>> >
>> > Any help?
>> >
>> >
>> >
>> > --
>> > Best Regards,
>> > Ayan Guha
>>
>>
>>


Re: How to perform reduce operation in the same order as partition indexes

2016-05-19 Thread ayan guha
You can add the index from mappartitionwithindex in the output and order
based on that in merge step
On 19 May 2016 13:22, "Pulasthi Supun Wickramasinghe" 
wrote:

> Hi Devs/All,
>
> I am pretty new to Spark. I have a program which does some map reduce
> operations with matrices. Here *shortrddFinal* is a of type "
> *RDD[Array[Short]]"* and consists of several partitions
>
> *var BC =
> shortrddFinal.mapPartitionsWithIndex(calculateBCInternal).reduce(mergeBC)*
>
> The map function produces a "Array[Array[Double]]" and at the reduce step
> i need to merge all the 2 dimensional double arrays produced for each
> partition into one big matrix. But i also need to keep the same order as
> the partitions. that is the 2D double array produced for partition 0 should
> be the first set of rows in the matrix and then the 2d double array
> produced for partition 1 and so on. Is there a way to enforce the order in
> the reduce step.
>
> Thanks in advance
>
> Best Regards,
> Pulasthi
> --
> Pulasthi S. Wickramasinghe
> Graduate Student  | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> cell: 224-386-9035
>


Re: Tar File: On Spark

2016-05-19 Thread ayan guha
Hi

Thanks for the input. Can it be possible to write it in python? I think I
can use FileUti.untar from hdfs jar. But can I do it from python?
On 19 May 2016 16:57, "Sun Rui"  wrote:

> 1. create a temp dir on HDFS, say “/tmp”
> 2. write a script to create in the temp dir one file for each tar file.
> Each file has only one line:
> 
> 3. Write a spark application. It is like:
>   val rdd = sc.textFile ()
>   rdd.map { line =>
>construct an untar command using the path information in “line” and
> launches the command
>   }
>
> > On May 19, 2016, at 14:42, ayan guha  wrote:
> >
> > Hi
> >
> > I have few tar files in HDFS in a single folder. each file has multiple
> files in it.
> >
> > tar1:
> >   - f1.txt
> >   - f2.txt
> > tar2:
> >   - f1.txt
> >   - f2.txt
> >
> > (each tar file will have exact same number of files, same name)
> >
> > I am trying to find a way (spark or pig) to extract them to their own
> folders.
> >
> > f1
> >   - tar1_f1.txt
> >   - tar2_f1.txt
> > f2:
> >- tar1_f2.txt
> >- tar1_f2.txt
> >
> > Any help?
> >
> >
> >
> > --
> > Best Regards,
> > Ayan Guha
>
>
>


Re: Couldn't find leader offsets

2016-05-19 Thread Colin Hall
Hey Cody, thanks for the response. I looked at connection as a possibility 
based on your advice and after a lot of digging found a couple of things 
mentioned on SO and kafka lists about name resolution causing issues.   I 
created an entry in /etc/hosts on the spark host to resolve the broker to its 
IP and that seemed to do the trick.

Thanks much.
ch.



> On May 19, 2016, at 8:19 AM, Cody Koeninger  wrote:
> 
> Looks like a networking issue to me.  Make sure you can connect to the
> broker on the specified host and port from the spark driver (and the
> executors too, for that matter)
> 
> On Wed, May 18, 2016 at 4:04 PM, samsayiam  wrote:
>> I have seen questions posted about this on SO and on this list but haven't
>> seen a response that addresses my issue.  I am trying to create a direct
>> stream connection to a kafka topic but it fails with Couldn't find leader
>> offsets for Set(...).  If I run a kafka consumer I can read the topic but
>> can't do it with spark.  Can someone tell me where I'm going wrong here?
>> 
>> Test topic info:
>> vagrant@broker1$ ./bin/kafka-topics.sh --describe --zookeeper 10.30.3.2:2181
>> --topic footopic
>> Topic:footopic  PartitionCount:1ReplicationFactor:1 Configs:
>>Topic: footopic Partition: 0Leader: 0   Replicas: 0   
>> Isr: 0
>> 
>> consuming from kafka:
>> vagrant@broker1$ bin/kafka-console-consumer.sh --zookeeper 10.30.3.2:2181
>> --from-beginning --topic footopic
>> this is a test
>> and so is this
>> goodbye
>> 
>> Attempting from spark:
>> spark-submit --class com.foo.Experiment --master local[*] --jars
>> /vagrant/spark-streaming-kafka-assembly_2.10-1.6.1.jar
>> /vagrant/spark-app-1.0-SNAPSHOT.jar 10.0.7.34:9092
>> 
>> ...
>> 
>> Using kafkaparams: {auto.offset.reset=smallest,
>> metadata.broker.list=10.0.7.34:9092}
>> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Verifying properties
>> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Property
>> auto.offset.reset is overridden to smallest
>> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Property group.id is
>> overridden to
>> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Property
>> zookeeper.connect is overridden to
>> 16/05/18 20:27:21 INFO consumer.SimpleConsumer: Reconnect due to socket
>> error: java.nio.channels.ClosedChannelException
>> Exception in thread "main" org.apache.spark.SparkException:
>> java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([footopic,0])
>> ...
>> 
>> 
>> Any help is appreciated.
>> 
>> Thanks,
>> ch.
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Couldn-t-find-leader-offsets-tp26978.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Splitting RDD by partition

2016-05-19 Thread shlomi
Hey Sparkers,

I have a workflow where I have to ensure certain keys are always in the same
RDD partition (its a mandatory algorithmic invariant). I can easily achieve
this by having a custom partitioner. 

This results in a single RDD that requires further computations. However,
currently there are two completely different computations for different
partitions. Some partitions are done by this steps and could already be
written to disk, while the rest still needs a few more
map/filter/shuffle/etc steps to complete 

The simplest idea I have for this is to have some way to partition the RDD
into multiple RDD's based on partition numbers (which I know from my custom
partitioner).

I have managed to achieve this like so (splitting to only 2 RDDs):
https://gist.github.com/vadali/3e5f832e4a6cb320e50b67dd05b3e97c
-
// Split an rdd according to its partition number
def splitByPartition[T:ClassTag](rdd: RDD[T], hotPartitions:Int): (RDD[T],
RDD[T]) = {
  val splits = rdd.mapPartitions { iter =>
val partId = TaskContext.get.partitionId
val left = if (partId <  hotPartitions) iter else empty
val right   = if (partId >= hotPartitions) iter else empty
Seq(left, right).iterator
  }

  val left = splits.mapPartitions { iter => iter.next().toIterator}
  val right = splits.mapPartitions { iter =>
iter.next()
iter.next().toIterator
  }
  (left, right)
}
-

Is this the best way? This seems to cause some shuffling, however I am not
sure how they impact performance..

Is there another way, maybe even a more involved way, to achieve this? 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-by-partition-tp26983.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
Then in theory every user can fire multiple spark-submit jobs. do you cap
it with settings in  $SPARK_HOME/conf/spark-defaults.conf , but I guess in
reality every user submits one job only.

This is an interesting model for two reasons:


   - It uses parallel processing across all the nodes or most of the nodes
   to minimise the processing time
   - it requires less intervention



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 19 May 2016 at 21:33, Mathieu Longtin  wrote:

> Driver memory is default. Executor memory depends on job, the caller
> decides how much memory to use. We don't specify --num-executors as we want
> all cores assigned to the local master, since they were started by the
> current user. No local executor.  --master=spark://localhost:someport. 1
> core per executor.
>
> On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh 
> wrote:
>
>> Thanks Mathieu
>>
>> So it would be interesting to see what resources allocated in your case,
>> especially the num-executors and executor-cores. I gather every node has
>> enough memory and cores.
>>
>>
>>
>> ${SPARK_HOME}/bin/spark-submit \
>>
>> --master local[2] \
>>
>> --driver-memory 4g \
>>
>> --num-executors=1 \
>>
>> --executor-memory=4G \
>>
>> --executor-cores=2 \
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 May 2016 at 21:02, Mathieu Longtin  wrote:
>>
>>> The driver (the process started by spark-submit) runs locally. The
>>> executors run on any of thousands of servers. So far, I haven't tried more
>>> than 500 executors.
>>>
>>> Right now, I run a master on the same server as the driver.
>>>
>>> On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 ok so you are using some form of NFS mounted file system shared among
 the nodes and basically you start the processes through spark-submit.

 In Stand-alone mode, a simple cluster manager included with Spark. It
 does the management of resources so it is not clear to me what you are
 referring as worker manager here?

 This is my take from your model.
  The application will go and grab all the cores in the cluster.
 You only have one worker that lives within the driver JVM process.
 The Driver node runs on the same host that the cluster manager is
 running. The Driver requests the Cluster Manager for resources to run
 tasks. In this case there is only one executor for the Driver? The Executor
 runs tasks for the Driver.


 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 19 May 2016 at 20:37, Mathieu Longtin 
 wrote:

> No master and no node manager, just the processes that do actual work.
>
> We use the "stand alone" version because we have a shared file system
> and a way of allocating computing resources already (Univa Grid Engine). 
> If
> an executor were to die, we have other ways of restarting it, we don't 
> need
> the worker manager to deal with it.
>
> On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi Mathieu
>>
>> What does this approach provide that the norm lacks?
>>
>> So basically each node has its master in this model.
>>
>> Are these supposed to be individual stand alone servers?
>>
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 May 2016 at 18:45, Mathieu Longtin 
>> wrote:
>>
>>> First a bit of context:
>>> We use Spark on a platform where each user start workers as needed.
>>> This has the advantage that all permission management is handled by the 
>>> OS,
>>> so the users can only read files they have permission to.
>>>
>>> To do this, we have some utility that does the following:
>>> - 

Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
Driver memory is default. Executor memory depends on job, the caller
decides how much memory to use. We don't specify --num-executors as we want
all cores assigned to the local master, since they were started by the
current user. No local executor.  --master=spark://localhost:someport. 1
core per executor.

On Thu, May 19, 2016 at 4:12 PM Mich Talebzadeh 
wrote:

> Thanks Mathieu
>
> So it would be interesting to see what resources allocated in your case,
> especially the num-executors and executor-cores. I gather every node has
> enough memory and cores.
>
>
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --master local[2] \
>
> --driver-memory 4g \
>
> --num-executors=1 \
>
> --executor-memory=4G \
>
> --executor-cores=2 \
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 21:02, Mathieu Longtin  wrote:
>
>> The driver (the process started by spark-submit) runs locally. The
>> executors run on any of thousands of servers. So far, I haven't tried more
>> than 500 executors.
>>
>> Right now, I run a master on the same server as the driver.
>>
>> On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> ok so you are using some form of NFS mounted file system shared among
>>> the nodes and basically you start the processes through spark-submit.
>>>
>>> In Stand-alone mode, a simple cluster manager included with Spark. It
>>> does the management of resources so it is not clear to me what you are
>>> referring as worker manager here?
>>>
>>> This is my take from your model.
>>>  The application will go and grab all the cores in the cluster.
>>> You only have one worker that lives within the driver JVM process.
>>> The Driver node runs on the same host that the cluster manager is
>>> running. The Driver requests the Cluster Manager for resources to run
>>> tasks. In this case there is only one executor for the Driver? The Executor
>>> runs tasks for the Driver.
>>>
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 20:37, Mathieu Longtin  wrote:
>>>
 No master and no node manager, just the processes that do actual work.

 We use the "stand alone" version because we have a shared file system
 and a way of allocating computing resources already (Univa Grid Engine). If
 an executor were to die, we have other ways of restarting it, we don't need
 the worker manager to deal with it.

 On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> Hi Mathieu
>
> What does this approach provide that the norm lacks?
>
> So basically each node has its master in this model.
>
> Are these supposed to be individual stand alone servers?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 18:45, Mathieu Longtin 
> wrote:
>
>> First a bit of context:
>> We use Spark on a platform where each user start workers as needed.
>> This has the advantage that all permission management is handled by the 
>> OS,
>> so the users can only read files they have permission to.
>>
>> To do this, we have some utility that does the following:
>> - start a master
>> - start worker managers on a number of servers
>> - "submit" the Spark driver program
>> - the driver then talks to the master, tell it how many executors it
>> needs
>> - the master tell the worker nodes to start executors and talk to the
>> driver
>> - the executors are started
>>
>> From here on, the master doesn't do much, neither do the process
>> manager on the worker nodes.
>>
>> What I would like to do is simplify this to:
>> - Start the driver program
>> - Start executors on a number of servers, telling them where to find
>> the driver
>> - The executors connect directly to the driver
>>
>> Is there a way I could do this without the master and worker managers?
>>
>> Thanks!
>>
>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977

Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
Thanks Mathieu

So it would be interesting to see what resources allocated in your case,
especially the num-executors and executor-cores. I gather every node has
enough memory and cores.



${SPARK_HOME}/bin/spark-submit \

--master local[2] \

--driver-memory 4g \

--num-executors=1 \

--executor-memory=4G \

--executor-cores=2 \

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 19 May 2016 at 21:02, Mathieu Longtin  wrote:

> The driver (the process started by spark-submit) runs locally. The
> executors run on any of thousands of servers. So far, I haven't tried more
> than 500 executors.
>
> Right now, I run a master on the same server as the driver.
>
> On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh 
> wrote:
>
>> ok so you are using some form of NFS mounted file system shared among the
>> nodes and basically you start the processes through spark-submit.
>>
>> In Stand-alone mode, a simple cluster manager included with Spark. It
>> does the management of resources so it is not clear to me what you are
>> referring as worker manager here?
>>
>> This is my take from your model.
>>  The application will go and grab all the cores in the cluster.
>> You only have one worker that lives within the driver JVM process.
>> The Driver node runs on the same host that the cluster manager is
>> running. The Driver requests the Cluster Manager for resources to run
>> tasks. In this case there is only one executor for the Driver? The Executor
>> runs tasks for the Driver.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 May 2016 at 20:37, Mathieu Longtin  wrote:
>>
>>> No master and no node manager, just the processes that do actual work.
>>>
>>> We use the "stand alone" version because we have a shared file system
>>> and a way of allocating computing resources already (Univa Grid Engine). If
>>> an executor were to die, we have other ways of restarting it, we don't need
>>> the worker manager to deal with it.
>>>
>>> On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Hi Mathieu

 What does this approach provide that the norm lacks?

 So basically each node has its master in this model.

 Are these supposed to be individual stand alone servers?


 Thanks


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 19 May 2016 at 18:45, Mathieu Longtin 
 wrote:

> First a bit of context:
> We use Spark on a platform where each user start workers as needed.
> This has the advantage that all permission management is handled by the 
> OS,
> so the users can only read files they have permission to.
>
> To do this, we have some utility that does the following:
> - start a master
> - start worker managers on a number of servers
> - "submit" the Spark driver program
> - the driver then talks to the master, tell it how many executors it
> needs
> - the master tell the worker nodes to start executors and talk to the
> driver
> - the executors are started
>
> From here on, the master doesn't do much, neither do the process
> manager on the worker nodes.
>
> What I would like to do is simplify this to:
> - Start the driver program
> - Start executors on a number of servers, telling them where to find
> the driver
> - The executors connect directly to the driver
>
> Is there a way I could do this without the master and worker managers?
>
> Thanks!
>
>
> --
> Mathieu Longtin
> 1-514-803-8977
>

 --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>


Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
The driver (the process started by spark-submit) runs locally. The
executors run on any of thousands of servers. So far, I haven't tried more
than 500 executors.

Right now, I run a master on the same server as the driver.

On Thu, May 19, 2016 at 3:49 PM Mich Talebzadeh 
wrote:

> ok so you are using some form of NFS mounted file system shared among the
> nodes and basically you start the processes through spark-submit.
>
> In Stand-alone mode, a simple cluster manager included with Spark. It
> does the management of resources so it is not clear to me what you are
> referring as worker manager here?
>
> This is my take from your model.
>  The application will go and grab all the cores in the cluster.
> You only have one worker that lives within the driver JVM process.
> The Driver node runs on the same host that the cluster manager is running.
> The Driver requests the Cluster Manager for resources to run tasks. In this
> case there is only one executor for the Driver? The Executor runs tasks for
> the Driver.
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 20:37, Mathieu Longtin  wrote:
>
>> No master and no node manager, just the processes that do actual work.
>>
>> We use the "stand alone" version because we have a shared file system and
>> a way of allocating computing resources already (Univa Grid Engine). If an
>> executor were to die, we have other ways of restarting it, we don't need
>> the worker manager to deal with it.
>>
>> On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hi Mathieu
>>>
>>> What does this approach provide that the norm lacks?
>>>
>>> So basically each node has its master in this model.
>>>
>>> Are these supposed to be individual stand alone servers?
>>>
>>>
>>> Thanks
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 19 May 2016 at 18:45, Mathieu Longtin  wrote:
>>>
 First a bit of context:
 We use Spark on a platform where each user start workers as needed.
 This has the advantage that all permission management is handled by the OS,
 so the users can only read files they have permission to.

 To do this, we have some utility that does the following:
 - start a master
 - start worker managers on a number of servers
 - "submit" the Spark driver program
 - the driver then talks to the master, tell it how many executors it
 needs
 - the master tell the worker nodes to start executors and talk to the
 driver
 - the executors are started

 From here on, the master doesn't do much, neither do the process
 manager on the worker nodes.

 What I would like to do is simplify this to:
 - Start the driver program
 - Start executors on a number of servers, telling them where to find
 the driver
 - The executors connect directly to the driver

 Is there a way I could do this without the master and worker managers?

 Thanks!


 --
 Mathieu Longtin
 1-514-803-8977

>>>
>>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
> --
Mathieu Longtin
1-514-803-8977


Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
ok so you are using some form of NFS mounted file system shared among the
nodes and basically you start the processes through spark-submit.

In Stand-alone mode, a simple cluster manager included with Spark. It does
the management of resources so it is not clear to me what you are referring
as worker manager here?

This is my take from your model.
 The application will go and grab all the cores in the cluster.
You only have one worker that lives within the driver JVM process.
The Driver node runs on the same host that the cluster manager is running.
The Driver requests the Cluster Manager for resources to run tasks. In this
case there is only one executor for the Driver? The Executor runs tasks for
the Driver.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 19 May 2016 at 20:37, Mathieu Longtin  wrote:

> No master and no node manager, just the processes that do actual work.
>
> We use the "stand alone" version because we have a shared file system and
> a way of allocating computing resources already (Univa Grid Engine). If an
> executor were to die, we have other ways of restarting it, we don't need
> the worker manager to deal with it.
>
> On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh 
> wrote:
>
>> Hi Mathieu
>>
>> What does this approach provide that the norm lacks?
>>
>> So basically each node has its master in this model.
>>
>> Are these supposed to be individual stand alone servers?
>>
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 19 May 2016 at 18:45, Mathieu Longtin  wrote:
>>
>>> First a bit of context:
>>> We use Spark on a platform where each user start workers as needed. This
>>> has the advantage that all permission management is handled by the OS, so
>>> the users can only read files they have permission to.
>>>
>>> To do this, we have some utility that does the following:
>>> - start a master
>>> - start worker managers on a number of servers
>>> - "submit" the Spark driver program
>>> - the driver then talks to the master, tell it how many executors it
>>> needs
>>> - the master tell the worker nodes to start executors and talk to the
>>> driver
>>> - the executors are started
>>>
>>> From here on, the master doesn't do much, neither do the process manager
>>> on the worker nodes.
>>>
>>> What I would like to do is simplify this to:
>>> - Start the driver program
>>> - Start executors on a number of servers, telling them where to find the
>>> driver
>>> - The executors connect directly to the driver
>>>
>>> Is there a way I could do this without the master and worker managers?
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> Mathieu Longtin
>>> 1-514-803-8977
>>>
>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>


Re: Starting executor without a master

2016-05-19 Thread Mathieu Longtin
No master and no node manager, just the processes that do actual work.

We use the "stand alone" version because we have a shared file system and a
way of allocating computing resources already (Univa Grid Engine). If an
executor were to die, we have other ways of restarting it, we don't need
the worker manager to deal with it.

On Thu, May 19, 2016 at 3:16 PM Mich Talebzadeh 
wrote:

> Hi Mathieu
>
> What does this approach provide that the norm lacks?
>
> So basically each node has its master in this model.
>
> Are these supposed to be individual stand alone servers?
>
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 18:45, Mathieu Longtin  wrote:
>
>> First a bit of context:
>> We use Spark on a platform where each user start workers as needed. This
>> has the advantage that all permission management is handled by the OS, so
>> the users can only read files they have permission to.
>>
>> To do this, we have some utility that does the following:
>> - start a master
>> - start worker managers on a number of servers
>> - "submit" the Spark driver program
>> - the driver then talks to the master, tell it how many executors it needs
>> - the master tell the worker nodes to start executors and talk to the
>> driver
>> - the executors are started
>>
>> From here on, the master doesn't do much, neither do the process manager
>> on the worker nodes.
>>
>> What I would like to do is simplify this to:
>> - Start the driver program
>> - Start executors on a number of servers, telling them where to find the
>> driver
>> - The executors connect directly to the driver
>>
>> Is there a way I could do this without the master and worker managers?
>>
>> Thanks!
>>
>>
>> --
>> Mathieu Longtin
>> 1-514-803-8977
>>
>
> --
Mathieu Longtin
1-514-803-8977


Re: Starting executor without a master

2016-05-19 Thread Mich Talebzadeh
Hi Mathieu

What does this approach provide that the norm lacks?

So basically each node has its master in this model.

Are these supposed to be individual stand alone servers?


Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 19 May 2016 at 18:45, Mathieu Longtin  wrote:

> First a bit of context:
> We use Spark on a platform where each user start workers as needed. This
> has the advantage that all permission management is handled by the OS, so
> the users can only read files they have permission to.
>
> To do this, we have some utility that does the following:
> - start a master
> - start worker managers on a number of servers
> - "submit" the Spark driver program
> - the driver then talks to the master, tell it how many executors it needs
> - the master tell the worker nodes to start executors and talk to the
> driver
> - the executors are started
>
> From here on, the master doesn't do much, neither do the process manager
> on the worker nodes.
>
> What I would like to do is simplify this to:
> - Start the driver program
> - Start executors on a number of servers, telling them where to find the
> driver
> - The executors connect directly to the driver
>
> Is there a way I could do this without the master and worker managers?
>
> Thanks!
>
>
> --
> Mathieu Longtin
> 1-514-803-8977
>


Re: Hive 2 database Entity-Relationship Diagram

2016-05-19 Thread Mich Talebzadeh
Thanks

These are the list of tables and views






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 19 May 2016 at 19:10, Thejas Nair  wrote:

> The pdf is not upside for me, in chrome browser.
>
>
> However, there seems to be many tables that are not related to (or
> rather used by) hive, specifically DMRV_* and DMRS_* ones.
>
>
> Thanks,
>
> Thejas
>
>
> --
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, May 19, 2016 10:53 AM
> *To:* user
> *Cc:* user @spark
>
> *Subject:* Re: Hive 2 database Entity-Relationship Diagram
>
> thanks Dudu for your comments
>
> I will check.
>
> I will realign the overlapping tables
>
> Only partial tables have relationship not all I am afraid. Most DMRS_%
> tables are standalone.
>
> I can see the PDF as it is can you kindly check the top left hand corner.
> The one below
>
>
>
>
>
> Do you see this upside down
>
> Thanks[image: Inline images 1]
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 May 2016 at 18:44, Markovitz, Dudu  wrote:
>
>> Thanks Mich
>>
>>
>>
>> I’m afraid the current format is not completely user friendly.
>>
>> I would suggest to divide the tables to multiple sets by subjects / graph
>> connectivity (BTW, it seems odd that most of the tables are disconnected)
>>
>>
>>
>> Also –
>>
>> · HIVEUSER.PARTITION_KEY_VALS is partially covering another table
>>
>> · The PDF is upside-down
>>
>>
>>
>> Dudu
>>
>>
>>
>> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
>> *Sent:* Thursday, May 19, 2016 8:04 PM
>> *To:* user ; user @spark 
>> *Subject:* Re: Hive 2 database Entity-Relationship Diagram
>>
>>
>>
>> Attachement
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn  
>> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>> On 19 May 2016 at 18:02, Mich Talebzadeh 
>> wrote:
>>
>> Hi All,
>>
>>
>>
>> I use Hive 2 with metastore created for Oracle Database with
>> hive-txn-schema-2.0.0.oracle.sql.
>>
>>
>>
>> It already includes concurrency stuff added into metastore
>>
>>
>>
>> The RDBMS is Oracle Database 12c Enterprise Edition Release 12.1.0.2.0.
>>
>>
>>
>>  I created an Entity-Relationship (ER) diagram from the physical model.
>> There are 194 tables, 127 views and 38 relationships. The relationship
>> notation is Bachman
>> 
>>
>>
>>
>> Fairly big diagram in PDF format. However, you can zoom into it.
>>
>>
>>
>>
>>
>> Please have a kook and appreciate comments to me and if it is useful we
>> can load it into wiki.
>>
>>
>>
>>
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn  
>> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>
>
select table_name from user_tables order by 1;

TABLE_NAME

AUX_TABLE
BUCKETING_COLS
CDS
COLUMNS_V2
COMPACTION_QUEUE
COMPLETED_COMPACTIONS
COMPLETED_TXN_COMPONENTS
DATABASE_PARAMS
DBS
DB_PRIVS
DELEGATION_TOKENS
DMRS_ADDITIONAL_CT_OBJECTS
DMRS_AGGR_FUNC_DIMENSIONS
DMRS_AGGR_FUNC_LEVELS
DMRS_ATTRIBUTES
DMRS_AVT
DMRS_BUSINESS_INFO
DMRS_CHANGE_REQUESTS
DMRS_CHANGE_REQUEST_ELEMENTS
DMRS_CHECK_CONSTRAINTS
DMRS_CLASSIFICATION_TYPES
DMRS_COLLECTION_TYPES
DMRS_COLUMNS
DMRS_COLUMN_GROUPS
DMRS_COLUMN_UI
DMRS_CONSTR_FK_COLUMNS
DMRS_CONSTR_INDEX_COLUMNS
DMRS_CONTACTS
DMRS_CONTACT_EMAILS
DMRS_CONTACT_LOCATIONS
DMRS_CONTACT_RES_LOCATORS
DMRS_CONTACT_TELEPHONES
DMRS_CUBES
DMRS_CUBE_DIMENSIONS
DMRS_DATA_FLOW_DIAGRAM_INFOS
DMRS_DESIGNS
DMRS_DIAGRAMS
DMRS_DIAGRAM_ELEMENTS
DMRS_DIMENSIONS
DMRS_DIMENSION_CALC_ATTRS
DMRS_DIMENSION_LEVELS
DMRS_DISTINCT_TYPES
DMRS_DOCUMENTS
DMRS_DOCUMENT_ELEMENTS
DMRS_DOMAINS
DMRS_DOMAIN_AVT
DMRS_DOMAIN_CHECK_CONSTRAINTS
DMRS_DOMAIN_VALUE_RANGES
DMRS_DYNAMIC_PROPERTIES
DMRS_EMAILS
DMRS_ENTITIES
DMRS_ENTITYVIEWS
DMRS_ENTITY_ARCS
DMRS_EVENTS
DMRS_EXISTENCE_DEP
DMRS_EXISTENCE_DEP_COLUMNS
DMRS_EXTERNAL_AGENTS
DMRS_EXTERNAL_DATAS
DMRS_EXT_AGENT_EXT_DATAS
DMRS_EXT_AGENT_FLOWS
DMRS_FACT_ENTITIES

Re: Latency experiment without losing executors

2016-05-19 Thread Ted Yu
16/05/19 15:51:39 WARN CoarseGrainedExecutorBackend: An unknown
(ip-10-171-80-97.ec2.internal:44765) driver disconnected.
16/05/19 15:51:42 ERROR TransportClient: Failed to send RPC
5466711974642652953 to ip-10-171-80-97.ec2.internal/10.171.80.97:44765:
java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException

Can you check the log for ip-10-171-80-97.ec2.internal to see if there was
some clue ?

Cheers

On Thu, May 19, 2016 at 9:24 AM, Geet Kumar  wrote:

> Ah, it seems the code did not show up in the email. Here is a link to the
> original post:
> http://apache-spark-user-list.1001560.n3.nabble.com/Latency-experiment-without-losing-executors-td26981.html
>
> Also, attached is the executor logs.​
>  spark-logging.log
> 
> ​
>
> Geet Kumar
> DataSys Laboratory, CS/IIT
> Linguistic Cognition Laboratory, CS/IIT
> Department of Computer Science, Illinois Institute of Technology (IIT)
> Email: gkum...@hawk.iit.edu
>
>
> On Thu, May 19, 2016 at 3:23 AM, Ted Yu  wrote:
>
>> I didn't see the code snippet. Were you using picture(s) ?
>>
>> Please pastebin the code.
>>
>> It would be better if you pastebin executor log for the killed executor.
>>
>> Thanks
>>
>> On Wed, May 18, 2016 at 9:41 PM, gkumar7  wrote:
>>
>>> I would like to test the latency (tasks/s) perceived in a simple
>>> application
>>> on Apache Spark.
>>>
>>> The idea: The workers will generate random data to be placed in a list.
>>> The
>>> final action (count) will count the total number of data points
>>> generated.
>>>
>>> Below, the numberOfPartitions is equal to the number of datapoints which
>>> need to be generated (datapoints are integers).
>>>
>>> Although the code works as expected, a total of 119 spark executors were
>>> killed while running with 64 slaves. I feel this is because since spark
>>> assigns executors to each node, the amount of total partitions each node
>>> is
>>> assigned to compute may be larger than the available memory on that node.
>>> This causes these executors to be killed and therefore, the latency
>>> measurement I would like to analyze is inaccurate.
>>>
>>> Any assistance with code cleanup below or how to fix the above issue to
>>> decrease the number of killed executors, would be much appreciated.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Latency-experiment-without-losing-executors-tp26981.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Hive 2 database Entity-Relationship Diagram

2016-05-19 Thread Mich Talebzadeh
Hi All,

I use Hive 2 with metastore created for Oracle Database with
hive-txn-schema-2.0.0.oracle.sql.

It already includes concurrency stuff added into metastore

The RDBMS is Oracle Database 12c Enterprise Edition Release 12.1.0.2.0.

 I created an Entity-Relationship (ER) diagram from the physical model.
There are 194 tables, 127 views and 38 relationships. The relationship
notation is Bachman


Fairly big diagram in PDF format. However, you can zoom into it.


Please have a kook and appreciate comments to me and if it is useful we can
load it into wiki.


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: Latency experiment without losing executors

2016-05-19 Thread Geet Kumar
Ah, it seems the code did not show up in the email. Here is a link to the
original post:
http://apache-spark-user-list.1001560.n3.nabble.com/Latency-experiment-without-losing-executors-td26981.html

Also, attached is the executor logs.​
 spark-logging.log

​

Geet Kumar
DataSys Laboratory, CS/IIT
Linguistic Cognition Laboratory, CS/IIT
Department of Computer Science, Illinois Institute of Technology (IIT)
Email: gkum...@hawk.iit.edu


On Thu, May 19, 2016 at 3:23 AM, Ted Yu  wrote:

> I didn't see the code snippet. Were you using picture(s) ?
>
> Please pastebin the code.
>
> It would be better if you pastebin executor log for the killed executor.
>
> Thanks
>
> On Wed, May 18, 2016 at 9:41 PM, gkumar7  wrote:
>
>> I would like to test the latency (tasks/s) perceived in a simple
>> application
>> on Apache Spark.
>>
>> The idea: The workers will generate random data to be placed in a list.
>> The
>> final action (count) will count the total number of data points generated.
>>
>> Below, the numberOfPartitions is equal to the number of datapoints which
>> need to be generated (datapoints are integers).
>>
>> Although the code works as expected, a total of 119 spark executors were
>> killed while running with 64 slaves. I feel this is because since spark
>> assigns executors to each node, the amount of total partitions each node
>> is
>> assigned to compute may be larger than the available memory on that node.
>> This causes these executors to be killed and therefore, the latency
>> measurement I would like to analyze is inaccurate.
>>
>> Any assistance with code cleanup below or how to fix the above issue to
>> decrease the number of killed executors, would be much appreciated.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Latency-experiment-without-losing-executors-tp26981.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: dataframe stat corr for multiple columns

2016-05-19 Thread Xiangrui Meng
This is nice to have. Please create a JIRA for it. Right now, you can merge
all columns into a vector column using RFormula or VectorAssembler, then
convert it into an RDD and call corr from MLlib.

On Tue, May 17, 2016, 7:09 AM Ankur Jain  wrote:

> Hello Team,
>
>
>
> In my current usecase I am loading data from CSV using spark-csv and
> trying to correlate all variables.
>
>
>
> As of now if we want to correlate 2 column in a dataframe * df.stat.corr*
> works great but if we want to correlate multiple columns this won’t work.
>
> In case of R we can use corrplot and correlate all numeric columns in a
> single line of code. Can you guide me how to achieve the same with
> dataframe or sql?
>
>
>
> There seems a way in spark-mllib
>
> http://spark.apache.org/docs/latest/mllib-statistics.html
>
>
>
>
>
> But it seems that it don’t take input as dataframe…
>
>
>
> Regards,
>
> Ankur
> Information transmitted by this e-mail is proprietary to YASH Technologies
> and/ or its Customers and is intended for use only by the individual or
> entity to which it is addressed, and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. If
> you are not the intended recipient or it appears that this mail has been
> forwarded to you without proper authority, you are notified that any use or
> dissemination of this information in any manner is strictly prohibited. In
> such cases, please notify us immediately at i...@yash.com and delete this
> mail from your records.
>


Re: Couldn't find leader offsets

2016-05-19 Thread Cody Koeninger
Looks like a networking issue to me.  Make sure you can connect to the
broker on the specified host and port from the spark driver (and the
executors too, for that matter)

On Wed, May 18, 2016 at 4:04 PM, samsayiam  wrote:
> I have seen questions posted about this on SO and on this list but haven't
> seen a response that addresses my issue.  I am trying to create a direct
> stream connection to a kafka topic but it fails with Couldn't find leader
> offsets for Set(...).  If I run a kafka consumer I can read the topic but
> can't do it with spark.  Can someone tell me where I'm going wrong here?
>
> Test topic info:
> vagrant@broker1$ ./bin/kafka-topics.sh --describe --zookeeper 10.30.3.2:2181
> --topic footopic
> Topic:footopic  PartitionCount:1ReplicationFactor:1 Configs:
> Topic: footopic Partition: 0Leader: 0   Replicas: 0   
> Isr: 0
>
> consuming from kafka:
> vagrant@broker1$ bin/kafka-console-consumer.sh --zookeeper 10.30.3.2:2181
> --from-beginning --topic footopic
> this is a test
> and so is this
> goodbye
>
> Attempting from spark:
> spark-submit --class com.foo.Experiment --master local[*] --jars
> /vagrant/spark-streaming-kafka-assembly_2.10-1.6.1.jar
> /vagrant/spark-app-1.0-SNAPSHOT.jar 10.0.7.34:9092
>
> ...
>
> Using kafkaparams: {auto.offset.reset=smallest,
> metadata.broker.list=10.0.7.34:9092}
> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Verifying properties
> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Property
> auto.offset.reset is overridden to smallest
> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Property group.id is
> overridden to
> 16/05/18 20:27:21 INFO utils.VerifiableProperties: Property
> zookeeper.connect is overridden to
> 16/05/18 20:27:21 INFO consumer.SimpleConsumer: Reconnect due to socket
> error: java.nio.channels.ClosedChannelException
> Exception in thread "main" org.apache.spark.SparkException:
> java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([footopic,0])
> ...
>
>
> Any help is appreciated.
>
> Thanks,
> ch.
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Couldn-t-find-leader-offsets-tp26978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does Structured Streaming support Kafka as data source?

2016-05-19 Thread Cody Koeninger
I went ahead and created

https://issues.apache.org/jira/browse/SPARK-15406

to track this

On Wed, May 18, 2016 at 9:55 PM, Todd  wrote:
> Hi,
> I brief the spark code, and it looks that structured streaming doesn't
> support kafka as data source yet?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: HBase / Spark Kerberos problem

2016-05-19 Thread Arun Natva
Some of the Hadoop services cannot make use of the ticket obtained by 
oginUserFromKeytab.

I was able to get past it using gss Jaas configuration where you can pass 
either Keytab file or ticketCache to spark executors that access HBase.

Sent from my iPhone

> On May 19, 2016, at 4:51 AM, Ellis, Tom (Financial Markets IT) 
>  wrote:
> 
> Yeah we ran into this issue. Key part is to have the hbase jars and 
> hbase-site.xml config on the classpath of the spark submitter.
>  
> We did it slightly differently from Y Bodnar, where we set the required jars 
> and config on the env var SPARK_DIST_CLASSPATH in our spark env file (rather 
> than SPARK_CLASSPATH which is deprecated).
>  
> With this and –principal/--keytab, if you turn DEBUG logging for 
> org.apache.spark.deploy.yarn you should see “Added HBase security token to 
> credentials.”
>  
> Otherwise you should at least hopefully see the error where it fails to add 
> the HBase tokens.
>  
> Check out the source of Client [1] and YarnSparkHadoopUtil  [2] – you’ll see 
> how obtainTokenForHBase is being done.
>  
> It’s a bit confusing as to why it says you haven’t kinited even when you do 
> loginUserFromKeytab – I haven’t quite worked through the reason for that yet.
>  
> Cheers,
>  
> Tom Ellis
> telli...@gmail.com
>  
> [1] 
> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
> [2] 
> https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala
>  
>  
> From: John Trengrove [mailto:john.trengr...@servian.com.au] 
> Sent: 19 May 2016 08:09
> To: philipp.meyerhoe...@thomsonreuters.com
> Cc: user
> Subject: Re: HBase / Spark Kerberos problem
>  
> -- This email has reached the Bank via an external source -- 
>  
> Have you had a look at this issue?
>  
> https://issues.apache.org/jira/browse/SPARK-12279 
>  
> There is a comment by Y Bodnar on how they successfully got Kerberos and 
> HBase working.
>  
> 2016-05-18 18:13 GMT+10:00 :
> Hi all,
> 
> I have been puzzling over a Kerberos problem for a while now and wondered if 
> anyone can help.
> 
> For spark-submit, I specify --keytab x --principal y, which creates my 
> SparkContext fine.
> Connections to Zookeeper Quorum to find the HBase master work well too.
> But when it comes to a .count() action on the RDD, I am always presented with 
> the stack trace at the end of this mail.
> 
> We are using CDH5.5.2 (spark 1.5.0), and 
> com.cloudera.spark.hbase.HBaseContext is a wrapper around 
> TableInputFormat/hadoopRDD (see 
> https://github.com/cloudera-labs/SparkOnHBase), as you can see in the stack 
> trace.
> 
> Am I doing something obvious wrong here?
> A similar flow, inside test code, works well, only going via spark-submit 
> exposes this issue.
> 
> Code snippet (I have tried using the commented-out lines in various 
> combinations, without success):
> 
>val conf = new SparkConf().
>   set("spark.shuffle.consolidateFiles", "true").
>   set("spark.kryo.registrationRequired", "false").
>   set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
>   set("spark.kryoserializer.buffer", "30m")
> val sc = new SparkContext(conf)
> val cfg = sc.hadoopConfiguration
> //cfg.addResource(new 
> org.apache.hadoop.fs.Path("/etc/hbase/conf/hbase-site.xml"))
> //
> UserGroupInformation.getCurrentUser.setAuthenticationMethod(UserGroupInformation.AuthenticationMethod.KERBEROS)
> //cfg.set("hbase.security.authentication", "kerberos")
> val hc = new HBaseContext(sc, cfg)
> val scan = new Scan
> scan.setTimeRange(startMillis, endMillis)
> val matchesInRange = hc.hbaseRDD(MY_TABLE, scan, resultToMatch)
> val cnt = matchesInRange.count()
> log.info(s"matches in range $cnt")
> 
> Stack trace / log:
> 
> 16/05/17 17:04:47 INFO SparkContext: Starting job: count at Analysis.scala:93
> 16/05/17 17:04:47 INFO DAGScheduler: Got job 0 (count at Analysis.scala:93) 
> with 1 output partitions
> 16/05/17 17:04:47 INFO DAGScheduler: Final stage: ResultStage 0(count at 
> Analysis.scala:93)
> 16/05/17 17:04:47 INFO DAGScheduler: Parents of final stage: List()
> 16/05/17 17:04:47 INFO DAGScheduler: Missing parents: List()
> 16/05/17 17:04:47 INFO DAGScheduler: Submitting ResultStage 0 
> (MapPartitionsRDD[1] at map at HBaseContext.scala:580), which has no missing 
> parents
> 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(3248) called with 
> curMem=428022, maxMem=244187136
> 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3 stored as values in 
> memory (estimated size 3.2 KB, free 232.5 MB)
> 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(2022) called with 
> curMem=431270, maxMem=244187136
> 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes 
> in memory (estimated size 2022.0 B, free 232.5 MB)
> 16/05/17 17:04:47 INFO 

Re: Spark Streaming Application run on yarn-clustor mode

2016-05-19 Thread Ted Yu
Yes.

See https://spark.apache.org/docs/latest/streaming-programming-guide.html

On Thu, May 19, 2016 at 7:24 AM,  wrote:

> Hi Friends,
>
> Is spark streaming job will run on yarn-cluster mode?
>
> Thanks
> Raj
>
>
> Sent from Yahoo Mail. Get the app 
>


Spark Streaming Application run on yarn-clustor mode

2016-05-19 Thread spark.raj
Hi Friends,
Is spark streaming job will run on yarn-cluster mode? 

ThanksRaj

Sent from Yahoo Mail. Get the app

RE: HBase / Spark Kerberos problem

2016-05-19 Thread philipp.meyerhoefer
Thanks Tom & John!

modifying spark-env.sh did the trick - my last line in the file is now:

export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt"):`hbase 
classpath`:/etc/hbase/conf:/etc/hbase/conf/hbase-site.xml

Now o.a.s.d.y.Client logs “Added HBase security token to credentials” and the 
.count() on my HBase RDD works fine.

From: Ellis, Tom (Financial Markets IT) [mailto:tom.el...@lloydsbanking.com] 
Sent: 19 May 2016 09:51
To: 'John Trengrove'; Meyerhoefer, Philipp (TR Technology & Ops)
Cc: user
Subject: RE: HBase / Spark Kerberos problem

Yeah we ran into this issue. Key part is to have the hbase jars and 
hbase-site.xml config on the classpath of the spark submitter.

We did it slightly differently from Y Bodnar, where we set the required jars 
and config on the env var SPARK_DIST_CLASSPATH in our spark env file (rather 
than SPARK_CLASSPATH which is deprecated).

With this and –principal/--keytab, if you turn DEBUG logging for 
org.apache.spark.deploy.yarn you should see “Added HBase security token to 
credentials.”

Otherwise you should at least hopefully see the error where it fails to add the 
HBase tokens.

Check out the source of Client [1] and YarnSparkHadoopUtil  [2] – you’ll see 
how obtainTokenForHBase is being done.

It’s a bit confusing as to why it says you haven’t kinited even when you do 
loginUserFromKeytab – I haven’t quite worked through the reason for that yet.

Cheers,

Tom Ellis
telli...@gmail.com

[1] 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
[2] 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala


From: John Trengrove [mailto:john.trengr...@servian.com.au] 
Sent: 19 May 2016 08:09
To: philipp.meyerhoe...@thomsonreuters.com
Cc: user
Subject: Re: HBase / Spark Kerberos problem

-- This email has reached the Bank via an external source -- 
  
Have you had a look at this issue?

https://issues.apache.org/jira/browse/SPARK-12279 

There is a comment by Y Bodnar on how they successfully got Kerberos and HBase 
working.

2016-05-18 18:13 GMT+10:00 :
Hi all,

I have been puzzling over a Kerberos problem for a while now and wondered if 
anyone can help.

For spark-submit, I specify --keytab x --principal y, which creates my 
SparkContext fine.
Connections to Zookeeper Quorum to find the HBase master work well too.
But when it comes to a .count() action on the RDD, I am always presented with 
the stack trace at the end of this mail.

We are using CDH5.5.2 (spark 1.5.0), and com.cloudera.spark.hbase.HBaseContext 
is a wrapper around TableInputFormat/hadoopRDD (see 
https://github.com/cloudera-labs/SparkOnHBase), as you can see in the stack 
trace.

Am I doing something obvious wrong here?
A similar flow, inside test code, works well, only going via spark-submit 
exposes this issue.

Code snippet (I have tried using the commented-out lines in various 
combinations, without success):

   val conf = new SparkConf().
      set("spark.shuffle.consolidateFiles", "true").
      set("spark.kryo.registrationRequired", "false").
      set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
      set("spark.kryoserializer.buffer", "30m")
    val sc = new SparkContext(conf)
    val cfg = sc.hadoopConfiguration
//    cfg.addResource(new 
org.apache.hadoop.fs.Path("/etc/hbase/conf/hbase-site.xml"))
//    
UserGroupInformation.getCurrentUser.setAuthenticationMethod(UserGroupInformation.AuthenticationMethod.KERBEROS)
//    cfg.set("hbase.security.authentication", "kerberos")
    val hc = new HBaseContext(sc, cfg)
    val scan = new Scan
    scan.setTimeRange(startMillis, endMillis)
    val matchesInRange = hc.hbaseRDD(MY_TABLE, scan, resultToMatch)
    val cnt = matchesInRange.count()
    log.info(s"matches in range $cnt")

Stack trace / log:

16/05/17 17:04:47 INFO SparkContext: Starting job: count at Analysis.scala:93
16/05/17 17:04:47 INFO DAGScheduler: Got job 0 (count at Analysis.scala:93) 
with 1 output partitions
16/05/17 17:04:47 INFO DAGScheduler: Final stage: ResultStage 0(count at 
Analysis.scala:93)
16/05/17 17:04:47 INFO DAGScheduler: Parents of final stage: List()
16/05/17 17:04:47 INFO DAGScheduler: Missing parents: List()
16/05/17 17:04:47 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at HBaseContext.scala:580), which has no missing 
parents
16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(3248) called with 
curMem=428022, maxMem=244187136
16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3 stored as values in 
memory (estimated size 3.2 KB, free 232.5 MB)
16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(2022) called with 
curMem=431270, maxMem=244187136
16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in 
memory (estimated size 2022.0 B, free 232.5 MB)
16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory 

RE: HBase / Spark Kerberos problem

2016-05-19 Thread Ellis, Tom (Financial Markets IT)
Yeah we ran into this issue. Key part is to have the hbase jars and 
hbase-site.xml config on the classpath of the spark submitter.

We did it slightly differently from Y Bodnar, where we set the required jars 
and config on the env var SPARK_DIST_CLASSPATH in our spark env file (rather 
than SPARK_CLASSPATH which is deprecated).

With this and –principal/--keytab, if you turn DEBUG logging for 
org.apache.spark.deploy.yarn you should see “Added HBase security token to 
credentials.”

Otherwise you should at least hopefully see the error where it fails to add the 
HBase tokens.

Check out the source of Client [1] and YarnSparkHadoopUtil  [2] – you’ll see 
how obtainTokenForHBase is being done.

It’s a bit confusing as to why it says you haven’t kinited even when you do 
loginUserFromKeytab – I haven’t quite worked through the reason for that yet.

Cheers,

Tom Ellis
telli...@gmail.com

[1] 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
[2] 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala


From: John Trengrove [mailto:john.trengr...@servian.com.au]
Sent: 19 May 2016 08:09
To: philipp.meyerhoe...@thomsonreuters.com
Cc: user
Subject: Re: HBase / Spark Kerberos problem

-- This email has reached the Bank via an external source --

Have you had a look at this issue?

https://issues.apache.org/jira/browse/SPARK-12279

There is a comment by Y Bodnar on how they successfully got Kerberos and HBase 
working.

2016-05-18 18:13 GMT+10:00 
>:
Hi all,

I have been puzzling over a Kerberos problem for a while now and wondered if 
anyone can help.

For spark-submit, I specify --keytab x --principal y, which creates my 
SparkContext fine.
Connections to Zookeeper Quorum to find the HBase master work well too.
But when it comes to a .count() action on the RDD, I am always presented with 
the stack trace at the end of this mail.

We are using CDH5.5.2 (spark 1.5.0), and com.cloudera.spark.hbase.HBaseContext 
is a wrapper around TableInputFormat/hadoopRDD (see 
https://github.com/cloudera-labs/SparkOnHBase), as you can see in the stack 
trace.

Am I doing something obvious wrong here?
A similar flow, inside test code, works well, only going via spark-submit 
exposes this issue.

Code snippet (I have tried using the commented-out lines in various 
combinations, without success):

   val conf = new SparkConf().
  set("spark.shuffle.consolidateFiles", "true").
  set("spark.kryo.registrationRequired", "false").
  set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
  set("spark.kryoserializer.buffer", "30m")
val sc = new SparkContext(conf)
val cfg = sc.hadoopConfiguration
//cfg.addResource(new 
org.apache.hadoop.fs.Path("/etc/hbase/conf/hbase-site.xml"))
//
UserGroupInformation.getCurrentUser.setAuthenticationMethod(UserGroupInformation.AuthenticationMethod.KERBEROS)
//cfg.set("hbase.security.authentication", "kerberos")
val hc = new HBaseContext(sc, cfg)
val scan = new Scan
scan.setTimeRange(startMillis, endMillis)
val matchesInRange = hc.hbaseRDD(MY_TABLE, scan, resultToMatch)
val cnt = matchesInRange.count()
log.info(s"matches in range $cnt")

Stack trace / log:

16/05/17 17:04:47 INFO SparkContext: Starting job: count at Analysis.scala:93
16/05/17 17:04:47 INFO DAGScheduler: Got job 0 (count at Analysis.scala:93) 
with 1 output partitions
16/05/17 17:04:47 INFO DAGScheduler: Final stage: ResultStage 0(count at 
Analysis.scala:93)
16/05/17 17:04:47 INFO DAGScheduler: Parents of final stage: List()
16/05/17 17:04:47 INFO DAGScheduler: Missing parents: List()
16/05/17 17:04:47 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at HBaseContext.scala:580), which has no missing 
parents
16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(3248) called with 
curMem=428022, maxMem=244187136
16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3 stored as values in 
memory (estimated size 3.2 KB, free 232.5 MB)
16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(2022) called with 
curMem=431270, maxMem=244187136
16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in 
memory (estimated size 2022.0 B, free 232.5 MB)
16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 
10.6.164.40:33563 (size: 2022.0 B, free: 232.8 MB)
16/05/17 17:04:47 INFO SparkContext: Created broadcast 3 from broadcast at 
DAGScheduler.scala:861
16/05/17 17:04:47 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at HBaseContext.scala:580)
16/05/17 17:04:47 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
16/05/17 17:04:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 

Filter out the elements from xml file in Spark

2016-05-19 Thread Yogesh Vyas
Hi,
I had xml files which I am reading through textFileStream, and then
filtering out the required elements using traditional conditions and
loops. I would like to know if  there is any specific packages or
functions provided in spark to perform operations on RDD of xml?

Regards,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Latency experiment without losing executors

2016-05-19 Thread Ted Yu
I didn't see the code snippet. Were you using picture(s) ?

Please pastebin the code.

It would be better if you pastebin executor log for the killed executor.

Thanks

On Wed, May 18, 2016 at 9:41 PM, gkumar7  wrote:

> I would like to test the latency (tasks/s) perceived in a simple
> application
> on Apache Spark.
>
> The idea: The workers will generate random data to be placed in a list. The
> final action (count) will count the total number of data points generated.
>
> Below, the numberOfPartitions is equal to the number of datapoints which
> need to be generated (datapoints are integers).
>
> Although the code works as expected, a total of 119 spark executors were
> killed while running with 64 slaves. I feel this is because since spark
> assigns executors to each node, the amount of total partitions each node is
> assigned to compute may be larger than the available memory on that node.
> This causes these executors to be killed and therefore, the latency
> measurement I would like to analyze is inaccurate.
>
> Any assistance with code cleanup below or how to fix the above issue to
> decrease the number of killed executors, would be much appreciated.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Latency-experiment-without-losing-executors-tp26981.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: HBase / Spark Kerberos problem

2016-05-19 Thread John Trengrove
Have you had a look at this issue?

https://issues.apache.org/jira/browse/SPARK-12279

There is a comment by Y Bodnar on how they successfully got Kerberos and
HBase working.

2016-05-18 18:13 GMT+10:00 :

> Hi all,
>
> I have been puzzling over a Kerberos problem for a while now and wondered
> if anyone can help.
>
> For spark-submit, I specify --keytab x --principal y, which creates my
> SparkContext fine.
> Connections to Zookeeper Quorum to find the HBase master work well too.
> But when it comes to a .count() action on the RDD, I am always presented
> with the stack trace at the end of this mail.
>
> We are using CDH5.5.2 (spark 1.5.0), and
> com.cloudera.spark.hbase.HBaseContext is a wrapper around
> TableInputFormat/hadoopRDD (see
> https://github.com/cloudera-labs/SparkOnHBase), as you can see in the
> stack trace.
>
> Am I doing something obvious wrong here?
> A similar flow, inside test code, works well, only going via spark-submit
> exposes this issue.
>
> Code snippet (I have tried using the commented-out lines in various
> combinations, without success):
>
>val conf = new SparkConf().
>   set("spark.shuffle.consolidateFiles", "true").
>   set("spark.kryo.registrationRequired", "false").
>   set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer").
>   set("spark.kryoserializer.buffer", "30m")
> val sc = new SparkContext(conf)
> val cfg = sc.hadoopConfiguration
> //cfg.addResource(new
> org.apache.hadoop.fs.Path("/etc/hbase/conf/hbase-site.xml"))
> //
> UserGroupInformation.getCurrentUser.setAuthenticationMethod(UserGroupInformation.AuthenticationMethod.KERBEROS)
> //cfg.set("hbase.security.authentication", "kerberos")
> val hc = new HBaseContext(sc, cfg)
> val scan = new Scan
> scan.setTimeRange(startMillis, endMillis)
> val matchesInRange = hc.hbaseRDD(MY_TABLE, scan, resultToMatch)
> val cnt = matchesInRange.count()
> log.info(s"matches in range $cnt")
>
> Stack trace / log:
>
> 16/05/17 17:04:47 INFO SparkContext: Starting job: count at
> Analysis.scala:93
> 16/05/17 17:04:47 INFO DAGScheduler: Got job 0 (count at
> Analysis.scala:93) with 1 output partitions
> 16/05/17 17:04:47 INFO DAGScheduler: Final stage: ResultStage 0(count at
> Analysis.scala:93)
> 16/05/17 17:04:47 INFO DAGScheduler: Parents of final stage: List()
> 16/05/17 17:04:47 INFO DAGScheduler: Missing parents: List()
> 16/05/17 17:04:47 INFO DAGScheduler: Submitting ResultStage 0
> (MapPartitionsRDD[1] at map at HBaseContext.scala:580), which has no
> missing parents
> 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(3248) called with
> curMem=428022, maxMem=244187136
> 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3 stored as values in
> memory (estimated size 3.2 KB, free 232.5 MB)
> 16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(2022) called with
> curMem=431270, maxMem=244187136
> 16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3_piece0 stored as
> bytes in memory (estimated size 2022.0 B, free 232.5 MB)
> 16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in
> memory on 10.6.164.40:33563 (size: 2022.0 B, free: 232.8 MB)
> 16/05/17 17:04:47 INFO SparkContext: Created broadcast 3 from broadcast at
> DAGScheduler.scala:861
> 16/05/17 17:04:47 INFO DAGScheduler: Submitting 1 missing tasks from
> ResultStage 0 (MapPartitionsRDD[1] at map at HBaseContext.scala:580)
> 16/05/17 17:04:47 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
> 16/05/17 17:04:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, hpg-dev-vm, partition 0,PROCESS_LOCAL, 2208 bytes)
> 16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in
> memory on hpg-dev-vm:52698 (size: 2022.0 B, free: 388.4 MB)
> 16/05/17 17:04:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in
> memory on hpg-dev-vm:52698 (size: 26.0 KB, free: 388.4 MB)
> 16/05/17 17:04:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> hpg-dev-vm): org.apache.hadoop.hbase.client.RetriesExhaustedException:
> Can't get the location
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:308)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:155)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:63)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:289)
> at
> org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:161)
> at
> org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:156)
> at
> 

HBase / Spark Kerberos problem

2016-05-19 Thread philipp.meyerhoefer
Hi all,

I have been puzzling over a Kerberos problem for a while now and wondered if 
anyone can help.

For spark-submit, I specify --master yarn-client --keytab x --principal y, 
which creates my SparkContext fine.
Connections to Zookeeper Quorum to find the HBase master work well too.
But when it comes to a .count() action on the RDD, I am always presented with 
the stack trace at the end of this mail.

We are using CDH5.5.2 (spark 1.5.0), and com.cloudera.spark.hbase.HBaseContext 
is a wrapper around TableInputFormat/hadoopRDD (see 
https://github.com/cloudera-labs/SparkOnHBase), as you can see in the stack 
trace.

Am I doing something obvious wrong here?
A similar flow, inside test code, works well, only going via spark-submit 
exposes this issue.

Code snippet (I have tried using the commented-out lines in various 
combinations, without success):

   val conf = new SparkConf().
  set("spark.shuffle.consolidateFiles", "true").
  set("spark.kryo.registrationRequired", "false").
  set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
  set("spark.kryoserializer.buffer", "30m")
val sc = new SparkContext(conf)
val cfg = sc.hadoopConfiguration
//cfg.addResource(new 
org.apache.hadoop.fs.Path("/etc/hbase/conf/hbase-site.xml"))
//
UserGroupInformation.getCurrentUser.setAuthenticationMethod(UserGroupInformation.AuthenticationMethod.KERBEROS)
//cfg.set("hbase.security.authentication", "kerberos")
val hc = new HBaseContext(sc, cfg)
val scan = new Scan
scan.setTimeRange(startMillis, endMillis)
val matchesInRange = hc.hbaseRDD(MY_TABLE, scan, resultToMatch)
val cnt = matchesInRange.count()
log.info(s"matches in range $cnt")

Stack trace / log:

16/05/17 17:04:47 INFO SparkContext: Starting job: count at Analysis.scala:93
16/05/17 17:04:47 INFO DAGScheduler: Got job 0 (count at Analysis.scala:93) 
with 1 output partitions
16/05/17 17:04:47 INFO DAGScheduler: Final stage: ResultStage 0(count at 
Analysis.scala:93)
16/05/17 17:04:47 INFO DAGScheduler: Parents of final stage: List()
16/05/17 17:04:47 INFO DAGScheduler: Missing parents: List()
16/05/17 17:04:47 INFO DAGScheduler: Submitting ResultStage 0 
(MapPartitionsRDD[1] at map at HBaseContext.scala:580), which has no missing 
parents
16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(3248) called with 
curMem=428022, maxMem=244187136
16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3 stored as values in 
memory (estimated size 3.2 KB, free 232.5 MB)
16/05/17 17:04:47 INFO MemoryStore: ensureFreeSpace(2022) called with 
curMem=431270, maxMem=244187136
16/05/17 17:04:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in 
memory (estimated size 2022.0 B, free 232.5 MB)
16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 
10.6.164.40:33563 (size: 2022.0 B, free: 232.8 MB)
16/05/17 17:04:47 INFO SparkContext: Created broadcast 3 from broadcast at 
DAGScheduler.scala:861
16/05/17 17:04:47 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 0 (MapPartitionsRDD[1] at map at HBaseContext.scala:580)
16/05/17 17:04:47 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
16/05/17 17:04:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 
hpg-dev-vm, partition 0,PROCESS_LOCAL, 2208 bytes)
16/05/17 17:04:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 
hpg-dev-vm:52698 (size: 2022.0 B, free: 388.4 MB)
16/05/17 17:04:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 
hpg-dev-vm:52698 (size: 26.0 KB, free: 388.4 MB)
16/05/17 17:04:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
hpg-dev-vm): org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't 
get the location
at 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:308)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:155)
at 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:63)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at 
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)
at 
org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:289)
at 
org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:161)
at 
org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:156)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:888)
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.restart(TableRecordReaderImpl.java:90)
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.initialize(TableRecordReaderImpl.java:167)
at 
org.apache.hadoop.hbase.mapreduce.TableRecordReader.initialize(TableRecordReader.java:138)
  

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
1. create a temp dir on HDFS, say “/tmp”
2. write a script to create in the temp dir one file for each tar file. Each 
file has only one line:

3. Write a spark application. It is like:
  val rdd = sc.textFile ()
  rdd.map { line =>
   construct an untar command using the path information in “line” and 
launches the command
  }

> On May 19, 2016, at 14:42, ayan guha  wrote:
> 
> Hi
> 
> I have few tar files in HDFS in a single folder. each file has multiple files 
> in it. 
> 
> tar1:
>   - f1.txt
>   - f2.txt
> tar2:
>   - f1.txt
>   - f2.txt
> 
> (each tar file will have exact same number of files, same name)
> 
> I am trying to find a way (spark or pig) to extract them to their own 
> folders. 
> 
> f1
>   - tar1_f1.txt
>   - tar2_f1.txt
> f2:
>- tar1_f2.txt
>- tar1_f2.txt
> 
> Any help? 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Tar File: On Spark

2016-05-19 Thread ayan guha
Hi

I have few tar files in HDFS in a single folder. each file has multiple
files in it.

tar1:
  - f1.txt
  - f2.txt
tar2:
  - f1.txt
  - f2.txt

(each tar file will have exact same number of files, same name)

I am trying to find a way (spark or pig) to extract them to their own
folders.

f1
  - tar1_f1.txt
  - tar2_f1.txt
f2:
   - tar1_f2.txt
   - tar1_f2.txt

Any help?



-- 
Best Regards,
Ayan Guha


Any way to pass custom hadoop conf to through spark thrift server ?

2016-05-19 Thread Jeff Zhang
I want to pass one custom hadoop conf to spark thrift server so that both
driver and executor side can get this conf. And I also want this custom
hadoop conf only detected by this user's job who set this conf.  Is it
possible for spark thrift server now ? Thanks



-- 
Best Regards

Jeff Zhang