Re: Get filename in Spark Streaming

2015-02-06 Thread Subacini B
Thank you Emre, This helps, i am able to get filename.

But i am not sure how to fit this into Dstream RDD.

val inputStream = ssc.textFileStream(/hdfs Path/)

inputStream is Dstreamrdd and in foreachrdd , am doing my processing

 inputStream.foreachRDD(rdd = {
   * //how to get filename here??*
})


Can you please help.


On Thu, Feb 5, 2015 at 11:15 PM, Emre Sevinc emre.sev...@gmail.com wrote:

 Hello,

 Did you check the following?


 http://themodernlife.github.io/scala/spark/hadoop/hdfs/2014/09/28/spark-input-filename/

 http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-td6551.html

 --
 Emre Sevinç


 On Fri, Feb 6, 2015 at 2:16 AM, Subacini B subac...@gmail.com wrote:

 Hi All,

 We have filename with timestamp say ABC_1421893256000.txt and the
 timestamp  needs to be extracted from file name for further processing.Is
 there a way to get input file name  picked up by spark streaming job?

 Thanks in advance

 Subacini




 --
 Emre Sevinc



Get filename in Spark Streaming

2015-02-05 Thread Subacini B
Hi All,

We have filename with timestamp say ABC_1421893256000.txt and the
timestamp  needs to be extracted from file name for further processing.Is
there a way to get input file name  picked up by spark streaming job?

Thanks in advance

Subacini


Improve performance using spark streaming + sparksql

2015-01-24 Thread Subacini B
Hi All,

I have a cluster of 3 nodes [each 8 core/32 GB memory].

My program uses Spark Streaming with Spark SQL[Spark 1.1]  and writes
incoming JSON  to elasticsearch, Hbase. Below is my code and i receive
json files [input data varies from 30MB to 300 MB] every 10 seconds.
Irrespective of 3 nodes or 1 node, processing time is pretty close, say 15
files takes 5 mins to do end -end process.

I have set spark.streaming.unpersist to true  , stream repartition to 3 ,
Kryo serialization  and using UseCompressedOops and some more GC mechanism
and perf tuning mentioned in spark docs. Still there is no much improvement
. Is there any other way that i can tune my code to improve performance.
Appreciate your help.thanks in advance.

 inputStream.foreachRDD(rdd = {

// Get SchemaRdd
   val inputJson = sqlContext.jsonRDD(rdd)

//Store it in sparksql table
   inputJson.registerTempTable(tableA)

// Parse JSON
val outputJson = sqlContext.sql(select colA, colB etcc  from
tableA)


//Write to ES, Hbase [uses  foreachpartition]
ESUtil.writeToES(outputJson, coreName)
 HBaseUtils.saveAsHBaseTable(outputJson, hbasetable)

  }
})


Re: SchemaRDD to Hbase

2014-12-20 Thread Subacini B
Hi ,

Can someone help me , Any pointers would help.

Thanks
Subacini

On Fri, Dec 19, 2014 at 10:47 PM, Subacini B subac...@gmail.com wrote:

 Hi All,

 Is there any API that can be used  directly  to write schemaRDD to HBase??
 If not, what is the best way to write  schemaRDD to HBase.

 Thanks
 Subacini



SchemaRDD to Hbase

2014-12-19 Thread Subacini B
Hi All,

Is there any API that can be used  directly  to write schemaRDD to HBase??
If not, what is the best way to write  schemaRDD to HBase.

Thanks
Subacini


Processing multiple request in cluster

2014-09-24 Thread Subacini B
hi All,

How to run concurrently multiple requests on same cluster.

I have a program using *spark streaming context *which reads* streaming
data* and writes it to HBase. It works fine, the problem is when multiple
requests are submitted to cluster, only first request is processed as the
entire cluster is used for this request. Rest of the requests are in
waiting mode.

i have set  spark.cores.max to 2 or less, so that it can process another
request,but if there is only one request cluster is not utilized properly.

Is there any way, that spark cluster can process streaming request
concurrently at the same time effectively utitlizing cluster, something
like sharkserver

Thanks
Subacini


Re: Spark SQL - groupby

2014-07-03 Thread Subacini B
Hi,

Can someone provide me pointers for this issue.

Thanks
Subacini


On Wed, Jul 2, 2014 at 3:34 PM, Subacini B subac...@gmail.com wrote:

 Hi,

 Below code throws  compilation error , not found: *value Sum* . Can
 someone help me on this. Do i need to add any jars or imports ? even for
 Count , same error is thrown

 val queryResult = sql(select * from Table)
  queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate(
 *Sum*('totB)).collect().foreach(println)

 Thanks
 subacini



Spark SQL - groupby

2014-07-02 Thread Subacini B
Hi,

Below code throws  compilation error , not found: *value Sum* . Can
someone help me on this. Do i need to add any jars or imports ? even for
Count , same error is thrown

val queryResult = sql(select * from Table)
 queryResult.groupBy('colA)('colA,*Sum*('colB) as 'totB).aggregate(*Sum*
('totB)).collect().foreach(println)

Thanks
subacini


Shark Vs Spark SQL

2014-07-02 Thread Subacini B
Hi,

http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3cb75376b8-7a57-4161-b604-f919886cf...@gmail.com%3E

This talks about  Shark backend will be replaced with Spark SQL engine in
future.
Does that mean Spark will continue to support Shark + Spark SQL for long
term? OR
After some period, Shark will be decommissioned ??

Thanks
Subacini


Re: Spark Worker Core Allocation

2014-06-08 Thread Subacini B
HI,

I am stuck here, my cluster is not effficiently utilized . Appreciate any
input on this.

Thanks
Subacini


On Sat, Jun 7, 2014 at 10:54 PM, Subacini B subac...@gmail.com wrote:

 Hi All,

 My cluster has 5 workers each having 4 cores (So total 20 cores).It is  in
 stand alone mode (not using Mesos or Yarn).I want two programs to run at
 same time. So I have configured spark.cores.max=3 , but when i run the
 program it allocates three cores taking one core from each worker making 3
 workers to run the program ,

 How to configure such that it takes 3 cores from 1 worker so that i can
 use other workers for second program.

 Thanks in advance
 Subacini



Re: Spark Worker Core Allocation

2014-06-08 Thread Subacini B
Thanks Sean, let me try to set spark.deploy.spreadOut  as  false.


On Sun, Jun 8, 2014 at 12:44 PM, Sean Owen so...@cloudera.com wrote:

 Have a look at:

 https://spark.apache.org/docs/1.0.0/job-scheduling.html
 https://spark.apache.org/docs/1.0.0/spark-standalone.html

 The default is to grab resource on all nodes. In your case you could set
 spark.cores.max to 2 or less to enable running two apps on a cluster of
 4-core machines simultaneously.

 See also spark.deploy.defaultCores

 But you may really be after spark.deploy.spreadOut. if you make it false
 it will instead try to take all resource from a few nodes.
  On Jun 8, 2014 1:55 AM, Subacini B subac...@gmail.com wrote:

 Hi All,

 My cluster has 5 workers each having 4 cores (So total 20 cores).It is
 in stand alone mode (not using Mesos or Yarn).I want two programs to run at
 same time. So I have configured spark.cores.max=3 , but when i run the
 program it allocates three cores taking one core from each worker making 3
 workers to run the program ,

 How to configure such that it takes 3 cores from 1 worker so that i can
 use other workers for second program.

 Thanks in advance
 Subacini