streaming on yarn

2016-06-24 Thread Alex Dzhagriev
Hello,

Can someone, please, share the opinions on the options available for
running spark streaming jobs on yarn? The first thing comes to my mind is
to use slider. Googling for such experience didn't give me much. From my
experience running the same jobs on mesos, I have two concerns: automatic
scaling (not blocking the resources if they is no data in the stream) and
the ui to manage the running jobs.

Thanks, Alex.


--jars for mesos cluster

2016-05-03 Thread Alex Dzhagriev
Hello all,

In the Mesos related spark docs (
http://spark.apache.org/docs/1.6.0/running-on-mesos.html#cluster-mode) I
found this statement:

Note that jars or python files that are passed to spark-submit should be
> URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically
> upload local jars.



However, when I'm trying to submit the job using:

./spark-submit --class MyDriver --master mesos://my-mesos:7077
--deploy-mode cluster --supervise --executor-memory 2G --jars
http://3rd-party.jar http://my.jar

I don't see the --jars on the slave machine:

sh -c './bin/spark-submit --name MyDriver --master mesos://my-mesos:7077
--driver-cores 1.0 --driver-memory 1024M --class MyDriver --executor-memory
2G ./my.jar '

And of course I'm running into the ClassNotFoundException as the 3rd party
library is not there.

Can someone, please, help how to specify the --jars correctly?

Thanks, Alex.


Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Alex Dzhagriev
Hi Vinay,

I believe it's not possible as the spark-shuffle code should run in the
same JVM process as the Node Manager. I haven't heard anything about on the
fly bytecode loading in the Node Manger.

Thanks, Alex.

On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap  wrote:

> Hi all,
>
> I am using *Spark 1.5.1* in *yarn-client* mode along with *CDH 5.5*
>
> As per the documentation to enable Dynamic Allocation of Executors in
> Spark,
> it is required to add the shuffle service jar to YARN Node Manager's
> classpath and restart the YARN Node Manager.
>
> Is there any way to to dynamically supply the shuffle service jar
> information from the application itself and avoid disturbing the running
> YARN service.
>
> Tried couple of options by uploading the jar to hdfs and set
> *yarn.application.classpath* but did not work. On container launch for
> the executor it fails to recognize the shuffle service.
>
> Any help would be greatly appreciated.
>
> --
> *Thanks and regards*
> *Vinay Kashyap*
>


Re: Sorting the RDD

2016-03-03 Thread Alex Dzhagriev
Hi Angel,

Your x() functions returns an Any type, thus there is no Ordering[Any]
defined in the scope and it doesn't make sense to define one. Basically
it's the same as to order java Objects, which don't have any fields. So the
problem is with your x() function, make sure it returns something
meaningful.

Cheers, Alex.

On Thu, Mar 3, 2016 at 8:39 AM, Angel Angel  wrote:

> Hello Sir/Madam,
>
> I am try to sort the RDD using *sortByKey* function but i am getting the
> following error.
>
>
> My code is
> 1) convert the rdd array into key value pair.
> 2) after that sort by key
>
> but i am getting the error *No implicit Ordering defined for any *
>
>  [image: Inline image 1]
>
>
>
> thanks
>


Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir,

Regarding the streaming, you can take a look at the spark streaming, the
micro-batching framework. If it satisfies your needs it has a bunch of
integrations. Thus, the source for the jobs could be Kafka, Flume or Akka.

Cheers, Alex.

On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael <moshir.mik...@gmail.com>
wrote:

> Hi Alex,
> thanks for the link. Will check it.
> Does someone know of a more streamlined approach ?
>
>
>
>
> Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev <dzh...@gmail.com> a écrit :
>
>> Hi Moshir,
>>
>> I think you can use the rest api provided with Spark:
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala
>>
>> Unfortunately, I haven't find any documentation, but it looks fine.
>> Thanks, Alex.
>>
>> On Sun, Feb 28, 2016 at 3:25 PM, mms <moshir.mik...@gmail.com> wrote:
>>
>>> Hi, I cannot find a simple example showing how a typical application can
>>> 'connect' to a remote spark cluster and interact with it. Let's say I have
>>> a Python web application hosted somewhere *outside *a spark cluster,
>>> with just python installed on it. How can I talk to Spark without using a
>>> notebook, or using ssh to connect to a cluster master node ? I know of
>>> spark-submit and spark-shell, however forking a process on a remote host to
>>> execute a shell script seems like a lot of effort What are the recommended
>>> ways to connect and query Spark from a remote client ? Thanks Thx !
>>> --
>>> View this message in context: Spark Integration Patterns
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Integration-Patterns-tp26354.html>
>>> Sent from the Apache Spark User List mailing list archive
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>
>>
>>


Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir,

I think you can use the rest api provided with Spark:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala

Unfortunately, I haven't find any documentation, but it looks fine.
Thanks, Alex.

On Sun, Feb 28, 2016 at 3:25 PM, mms  wrote:

> Hi, I cannot find a simple example showing how a typical application can
> 'connect' to a remote spark cluster and interact with it. Let's say I have
> a Python web application hosted somewhere *outside *a spark cluster, with
> just python installed on it. How can I talk to Spark without using a
> notebook, or using ssh to connect to a cluster master node ? I know of
> spark-submit and spark-shell, however forking a process on a remote host to
> execute a shell script seems like a lot of effort What are the recommended
> ways to connect and query Spark from a remote client ? Thanks Thx !
> --
> View this message in context: Spark Integration Patterns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: reasonable number of executors

2016-02-24 Thread Alex Dzhagriev
Hi Igor,

That's a great talk and an exact answer to my question. Thank you.

Cheers, Alex.

On Tue, Feb 23, 2016 at 8:27 PM, Igor Berman <igor.ber...@gmail.com> wrote:

>
> http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
>
> there is a section that is connected to your question
>
> On 23 February 2016 at 16:49, Alex Dzhagriev <dzh...@gmail.com> wrote:
>
>> Hello all,
>>
>> Can someone please advise me on the pros and cons on how to allocate the
>> resources: many small heap machines with 1 core or few machines with big
>> heaps and many cores? I'm sure that depends on the data flow and there is
>> no best practise solution. E.g. with bigger heap I can perform map-side
>> join with bigger table. What other considerations should I keep in mind in
>> order to choose the right configuration?
>>
>> Thanks, Alex.
>>
>
>


reasonable number of executors

2016-02-23 Thread Alex Dzhagriev
Hello all,

Can someone please advise me on the pros and cons on how to allocate the
resources: many small heap machines with 1 core or few machines with big
heaps and many cores? I'm sure that depends on the data flow and there is
no best practise solution. E.g. with bigger heap I can perform map-side
join with bigger table. What other considerations should I keep in mind in
order to choose the right configuration?

Thanks, Alex.


Re: Can we load csv partitioned data into one DF?

2016-02-22 Thread Alex Dzhagriev
Hi Saif,

You can put your files into one directory and read it as text. Another
option is to read them separately and then union the datasets.

Thanks, Alex.

On Mon, Feb 22, 2016 at 4:25 PM,  wrote:

> Hello all, I am facing a silly data question.
>
> If I have +100 csv files which are part of the same data, but each csv is
> for example, a year on a timeframe column (i.e. partitioned by year),
> what would you suggest instead of loading all those files and joining them?
>
> Final target would be parquet. Is it possible, for example, to load them
> and then store them as parquet, and then read parquet and consider all as
> one?
>
> Thanks for any suggestions,
> Saif
>
>


an OOM while persist as DISK_ONLY

2016-02-22 Thread Alex Dzhagriev
Hello all,

I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have
only ~800GB RAM  in total, so I am choosing the DISK_ONLY storage level.
Unfortunately, I'm getting out of the overhead memory limit:


Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB
physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.


I'm giving 6GB overhead memory and using 10 cores per executor. Apparently,
that's not enough. Without persisting the data and later computing the
dataset (twice in my case) the job works fine. Can anyone, please, explain
what is the overhead which consumes that much memory during persist to the
disk and how can I estimate what extra memory should I give to the
executors in order to make it not fail?

Thanks, Alex.


Re: Importing csv files into Hive ORC target table

2016-02-18 Thread Alex Dzhagriev
Hi Mich,

Try to use a regexp to parse your string instead of the split.

Thanks, Alex.

On Thu, Feb 18, 2016 at 6:35 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

>
>
> thanks,
>
>
>
> I have an issue here.
>
> define rdd to read the CSV file
>
> scala> var csv = sc.textFile("/data/stg/table2")
> csv: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[69] at textFile
> at :27
>
> I then get rid of the header
>
> scala> val csv2 = csv.mapPartitionsWithIndex { (idx, iter) => if (idx ==
> 0) iter.drop(1) else iter }
> csv2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at
> mapPartitionsWithIndex at :29
>
> This is what I have now
>
> scala> csv.first
>
> res79: String = Invoice Number,Payment date,Net,VAT,Total
>
> *scala> csv2.first*
>
> *res80: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"*
>
> Then I define a class based on the columns
>
> scala> case class Invoice(invoice: String, date: String, net: String, vat:
> String, total: String)
> defined class Invoice
>
> Next stage to map the data to their individual columns
>
> scala> val ttt = csv2.map(_.split(",")).map(p =>
> Invoice(p(0),p(1),p(2),p(3),p(4)))
> ttt: org.apache.spark.rdd.RDD[Invoice] = MapPartitionsRDD[74] at map at
> :33
>
> the problem now I have is that one column is missing
>
> *scala> ttt.first*
> *res81: Invoice = Invoice(360,10/02/2014,"?2,500.00",?0.00)*
>
> it seems that I am missing the last column here!
>
> I suspect the cause of the problem is the "," used in "?2,500.00" which is
> a money column of "£25" in excel.
>
> Any work around is appreciated.
>
> Thanks,
>
> Mich
>
>
>
>
>
> On 17/02/2016 22:58, Alex Dzhagriev wrote:
>
> Hi Mich,
>
> You can use data frames (
> http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
> to achieve that.
>
> val sqlContext = new HiveContext(sc)
>
> var rdd = sc.textFile("/data/stg/table2")
>
> //...
> //perform you business logic, cleanups, etc.
> //...
>
> sqlContext.createDataFrame(resultRdd).write.orc("..path..")
>
> Please, note that resultRdd should contain Products (e.g. case classes)
>
> Cheers, Alex.
>
>
>
> On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh <
> mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>
>> Hi,
>>
>> We put csv files that are zipped using bzip into a staging are on hdfs
>>
>> In Hive an external table is created as below:
>>
>> DROP TABLE IF EXISTS stg_t2;
>> CREATE EXTERNAL TABLE stg_t2 (
>>  INVOICENUMBER string
>> ,PAYMENTDATE string
>> ,NET string
>> ,VAT string
>> ,TOTAL string
>> )
>> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
>> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
>> STORED AS TEXTFILE
>> LOCATION '/data/stg/table2'
>> TBLPROPERTIES ("skip.header.line.count"="1")
>>
>> We have an ORC table in Hive created as below:
>>
>>
>>
>> DROP TABLE IF EXISTS t2;
>> CREATE TABLE t2 (
>>  INVOICENUMBER  INT
>> ,PAYMENTDATEtimestamp
>> ,NETDECIMAL(20,2)
>> ,VATDECIMAL(20,2)
>> ,TOTAL  DECIMAL(20,2)
>> )
>> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
>> STORED AS ORC
>> TBLPROPERTIES ( "orc.compress"="ZLIB" )
>> ;
>>
>> Then we insert the data from the external table into target table do some
>> conversion and ignoring empty rows
>>
>> INSERT INTO TABLE t2
>> SELECT
>>   INVOICENUMBER
>> , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/')*1000 as timestamp)
>> --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
>> , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>> , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
>> , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
>> FROM
>> stg_t2
>>
>> This works OK for now.
>>
>>
>>
>> I was wondering whether this could be done using operations on rdd in
>> Spark?
>>
>> var rdd = sc.textFile("/data/stg/table2")
>>
>> I can use rdd.count to see the total rows and
>> rdd.collect.foreach(println) to see the individual rows
>>
>>
>>
>> I would like to get 

Re: Importing csv files into Hive ORC target table

2016-02-17 Thread Alex Dzhagriev
Hi Mich,

You can use data frames (
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
to achieve that.

val sqlContext = new HiveContext(sc)

var rdd = sc.textFile("/data/stg/table2")

//...
//perform you business logic, cleanups, etc.
//...

sqlContext.createDataFrame(resultRdd).write.orc("..path..")

Please, note that resultRdd should contain Products (e.g. case classes)

Cheers, Alex.



On Wed, Feb 17, 2016 at 11:43 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

> Hi,
>
> We put csv files that are zipped using bzip into a staging are on hdfs
>
> In Hive an external table is created as below:
>
> DROP TABLE IF EXISTS stg_t2;
> CREATE EXTERNAL TABLE stg_t2 (
>  INVOICENUMBER string
> ,PAYMENTDATE string
> ,NET string
> ,VAT string
> ,TOTAL string
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/data/stg/table2'
> TBLPROPERTIES ("skip.header.line.count"="1")
>
> We have an ORC table in Hive created as below:
>
>
>
> DROP TABLE IF EXISTS t2;
> CREATE TABLE t2 (
>  INVOICENUMBER  INT
> ,PAYMENTDATEtimestamp
> ,NETDECIMAL(20,2)
> ,VATDECIMAL(20,2)
> ,TOTAL  DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet PayInsPeridaleTechnology'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> ;
>
> Then we insert the data from the external table into target table do some
> conversion and ignoring empty rows
>
> INSERT INTO TABLE t2
> SELECT
>   INVOICENUMBER
> , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/')*1000 as timestamp)
> --, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
> , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
> , CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
> , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
> FROM
> stg_t2
>
> This works OK for now.
>
>
>
> I was wondering whether this could be done using operations on rdd in
> Spark?
>
> var rdd = sc.textFile("/data/stg/table2")
>
> I can use rdd.count to see the total rows and rdd.collect.foreach(println)
> to see the individual rows
>
>
>
> I would like to get some ideas on how I can do CAST conversion etc on the
> data to clean it up and store it in the said ORC table?
>
>
>
> Thanks
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>


cartesian with Dataset

2016-02-17 Thread Alex Dzhagriev
Hello all,

Is anybody aware of any plans to support cartesian for Datasets? Are there
any ways to work around this issue without switching to RDDs?

Thanks, Alex.