Re: Access_Remote_Kerberized_Cluster_Through_Spark

2016-11-09 Thread Ajay Chander
Hi Everyone,

I am still trying to figure this one out. I am stuck with this error
"java.io.IOException:
Can't get Master Kerberos principal for use as renewer ". Below is my code.
Can any of you please provide any insights on this? Thanks for your time.


import java.io.{BufferedInputStream, File, FileInputStream}
import java.net.URI

import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.IOUtils
import org.apache.hadoop.security.UserGroupInformation
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.{SparkConf, SparkContext}


object SparkHdfs {

  def main(args: Array[String]): Unit = {

System.setProperty("java.security.krb5.conf", new
File("src\\main\\files\\krb5.conf").getAbsolutePath )
System.setProperty("sun.security.krb5.debug", "true")

val sparkConf = new SparkConf().setAppName("SparkHdfs").setMaster("local")
val sc = new SparkContext(sparkConf)
//Loading remote cluster configurations
sc.hadoopConfiguration.addResource(new
File("src\\main\\files\\core-site.xml").getAbsolutePath )
sc.hadoopConfiguration.addResource(new
File("src\\main\\files\\hdfs-site.xml").getAbsolutePath )
sc.hadoopConfiguration.addResource(new
File("src\\main\\files\\mapred-site.xml").getAbsolutePath )
sc.hadoopConfiguration.addResource(new
File("src\\main\\files\\yarn-site.xml").getAbsolutePath )
sc.hadoopConfiguration.addResource(new
File("src\\main\\files\\ssl-client.xml").getAbsolutePath )
sc.hadoopConfiguration.addResource(new
File("src\\main\\files\\topology.map").getAbsolutePath )

val conf = new Configuration()
//Loading remote cluster configurations
conf.addResource(new Path(new
File("src\\main\\files\\core-site.xml").getAbsolutePath ))
conf.addResource(new Path(new
File("src\\main\\files\\hdfs-site.xml").getAbsolutePath ))
conf.addResource(new Path(new
File("src\\main\\files\\mapred-site.xml").getAbsolutePath ))
conf.addResource(new Path(new
File("src\\main\\files\\yarn-site.xml").getAbsolutePath ))
conf.addResource(new Path(new
File("src\\main\\files\\ssl-client.xml").getAbsolutePath ))
conf.addResource(new Path(new
File("src\\main\\files\\topology.map").getAbsolutePath ))

conf.set("hadoop.security.authentication", "Kerberos")

UserGroupInformation.setConfiguration(conf)

UserGroupInformation.loginUserFromKeytab("my...@internal.company.com",
  new File("src\\main\\files\\myusr.keytab").getAbsolutePath )

//   
SparkHadoopUtil.get.loginUserFromKeytab("tsad...@internal.imsglobal.com",
//  new File("src\\main\\files\\tsadusr.keytab").getAbsolutePath)
//Getting this error: java.io.IOException: Can't get Master
Kerberos principal for use as renewer


sc.textFile("hdfs://vm1.comp.com:8020/user/myusr/temp/file1").collect().foreach(println)
//Getting this error: java.io.IOException: Can't get Master
Kerberos principal for use as renewer

  }
}




On Mon, Nov 7, 2016 at 9:42 PM, Ajay Chander <itsche...@gmail.com> wrote:

> Did anyone use https://www.codatlas.com/github.com/apache/spark/HEAD/
> core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala to
> interact with secured Hadoop from Spark ?
>
> Thanks,
> Ajay
>
> On Mon, Nov 7, 2016 at 4:37 PM, Ajay Chander <itsche...@gmail.com> wrote:
>
>>
>> Hi Everyone,
>>
>> I am trying to develop a simple codebase on my machine to read data from
>> secured Hadoop cluster. We have a development cluster which is secured
>> through Kerberos and I want to run a Spark job from my IntelliJ to read
>> some sample data from the cluster. Has anyone done this before ? Can you
>> point me to some sample examples?
>>
>> I understand that, if we want to talk to secured cluster, we need to have
>> keytab and principle. I tried using it through
>> UserGroupInformation.loginUserFromKeytab and
>> SparkHadoopUtil.get.loginUserFromKeytab but so far no luck.
>>
>> I have been trying to do this from quite a while ago. Please let me know
>> if you need more info. Thanks
>>
>> Regards,
>> Ajay
>>
>
>


Re: Access_Remote_Kerberized_Cluster_Through_Spark

2016-11-07 Thread Ajay Chander
Did anyone use
https://www.codatlas.com/github.com/apache/spark/HEAD/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
to interact with secured Hadoop from Spark ?

Thanks,
Ajay

On Mon, Nov 7, 2016 at 4:37 PM, Ajay Chander <itsche...@gmail.com> wrote:

>
> Hi Everyone,
>
> I am trying to develop a simple codebase on my machine to read data from
> secured Hadoop cluster. We have a development cluster which is secured
> through Kerberos and I want to run a Spark job from my IntelliJ to read
> some sample data from the cluster. Has anyone done this before ? Can you
> point me to some sample examples?
>
> I understand that, if we want to talk to secured cluster, we need to have
> keytab and principle. I tried using it through 
> UserGroupInformation.loginUserFromKeytab
> and SparkHadoopUtil.get.loginUserFromKeytab but so far no luck.
>
> I have been trying to do this from quite a while ago. Please let me know
> if you need more info. Thanks
>
> Regards,
> Ajay
>


Access_Remote_Kerberized_Cluster_Through_Spark

2016-11-07 Thread Ajay Chander
Hi Everyone,

I am trying to develop a simple codebase on my machine to read data from
secured Hadoop cluster. We have a development cluster which is secured
through Kerberos and I want to run a Spark job from my IntelliJ to read
some sample data from the cluster. Has anyone done this before ? Can you
point me to some sample examples?

I understand that, if we want to talk to secured cluster, we need to have
keytab and principle. I tried using it through
UserGroupInformation.loginUserFromKeytab
and SparkHadoopUtil.get.loginUserFromKeytab but so far no luck.

I have been trying to do this from quite a while ago. Please let me know if
you need more info. Thanks

Regards,
Ajay


Re: HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Sean, thank you for making it clear. It was helpful.

Regards,
Ajay

On Wednesday, October 26, 2016, Sean Owen <so...@cloudera.com> wrote:

> This usage is fine, because you are only using the HiveContext locally on
> the driver. It's applied in a function that's used on a Scala collection.
>
> You can't use the HiveContext or SparkContext in a distribution operation.
> It has nothing to do with for loops.
>
> The fact that they're serializable is misleading. It's there, I believe,
> because these objects may be inadvertently referenced in the closure of a
> function that executes remotely, yet doesn't use the context. The closure
> cleaner can't always remove this reference. The task would fail to
> serialize even though it doesn't use the context. You will find these
> objects serialize but then don't work if used remotely.
>
> The NPE you see is an unrelated cosmetic problem that was fixed in 2.0.1
> IIRC.
>
> On Wed, Oct 26, 2016 at 4:28 AM Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Everyone,
>>
>> I was thinking if I can use hiveContext inside foreach like below,
>>
>> object Test {
>>   def main(args: Array[String]): Unit = {
>>
>> val conf = new SparkConf()
>> val sc = new SparkContext(conf)
>> val hiveContext = new HiveContext(sc)
>>
>> val dataElementsFile = args(0)
>> val deDF = 
>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>
>> def calculate(de: Row) {
>>   val dataElement = de.getAs[String]("DataElement").trim
>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>> TEST_DB.TEST_TABLE1 ")
>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>> }
>>
>> deDF.collect().foreach(calculate)
>>   }
>> }
>>
>>
>> I looked at 
>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>  and I see it is extending SqlContext which extends Logging with 
>> Serializable.
>>
>> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>>
>> Regards,
>>
>> Ajay
>>
>>


Re: HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Sunita, Thanks for your time. In my scenario, based on each attribute from
deDF(1 column with just 66 rows), I have to query a Hive table and insert
into another table.

Thanks,
Ajay

On Wed, Oct 26, 2016 at 12:21 AM, Sunita Arvind <sunitarv...@gmail.com>
wrote:

> Ajay,
>
> Afaik Generally these contexts cannot be accessed within loops. The sql
> query itself would run on distributed datasets so it's a parallel
> execution. Putting them in foreach would make it nested in nested. So
> serialization would become hard. Not sure I could explain it right.
>
> If you can create the dataframe in main, you can register it as a table
> and run the queries in main method itself. You don't need to coalesce or
> run the method within foreach.
>
> Regards
> Sunita
>
> On Tuesday, October 25, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>
>>
>> Jeff, Thanks for your response. I see below error in the logs. You think
>> it has to do anything with hiveContext ? Do I have to serialize it before
>> using inside foreach ?
>>
>> 16/10/19 15:16:23 ERROR scheduler.LiveListenerBus: Listener SQLListener
>> threw an exception
>> java.lang.NullPointerException
>> at org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLL
>> istener.scala:167)
>> at org.apache.spark.scheduler.SparkListenerBus$class.onPostEven
>> t(SparkListenerBus.scala:42)
>> at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveL
>> istenerBus.scala:31)
>> at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveL
>> istenerBus.scala:31)
>> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBu
>> s.scala:55)
>> at org.apache.spark.util.AsynchronousListenerBus.postToAll(Asyn
>> chronousListenerBus.scala:37)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Asynchronous
>> ListenerBus.scala:80)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonf
>> un$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>> at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.sca
>> la:1181)
>> at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(As
>> ynchronousListenerBus.scalnerBus.scala:63)
>>
>> Thanks,
>> Ajay
>>
>> On Tue, Oct 25, 2016 at 11:45 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>>
>>> In your sample code, you can use hiveContext in the foreach as it is
>>> scala List foreach operation which runs in driver side. But you cannot use
>>> hiveContext in RDD.foreach
>>>
>>>
>>>
>>> Ajay Chander <itsche...@gmail.com>于2016年10月26日周三 上午11:28写道:
>>>
>>>> Hi Everyone,
>>>>
>>>> I was thinking if I can use hiveContext inside foreach like below,
>>>>
>>>> object Test {
>>>>   def main(args: Array[String]): Unit = {
>>>>
>>>> val conf = new SparkConf()
>>>> val sc = new SparkContext(conf)
>>>> val hiveContext = new HiveContext(sc)
>>>>
>>>> val dataElementsFile = args(0)
>>>> val deDF = 
>>>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>>>
>>>> def calculate(de: Row) {
>>>>   val dataElement = de.getAs[String]("DataElement").trim
>>>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>>>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>>>> TEST_DB.TEST_TABLE1 ")
>>>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>>>> }
>>>>
>>>> deDF.collect().foreach(calculate)
>>>>   }
>>>> }
>>>>
>>>>
>>>> I looked at 
>>>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>>>  and I see it is extending SqlContext which extends Logging with 
>>>> Serializable.
>>>>
>>>> Can anyone tell me if this is the right way to use it ? Thanks for your 
>>>> time.
>>>>
>>>> Regards,
>>>>
>>>> Ajay
>>>>
>>>>
>>


Re: HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Jeff, Thanks for your response. I see below error in the logs. You think it
has to do anything with hiveContext ? Do I have to serialize it before
using inside foreach ?

16/10/19 15:16:23 ERROR scheduler.LiveListenerBus: Listener SQLListener
threw an exception
java.lang.NullPointerException
at org.apache.spark.sql.execution.ui.SQLListener.
onTaskEnd(SQLListener.scala:167)
at org.apache.spark.scheduler.SparkListenerBus$class.
onPostEvent(SparkListenerBus.scala:42)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(
LiveListenerBus.scala:31)
at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(
LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(
ListenerBus.scala:55)
at org.apache.spark.util.AsynchronousListenerBus.postToAll(
AsynchronousListenerBus.scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(
AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(
AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(
AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.
scala:1181)
at org.apache.spark.util.AsynchronousListenerBus$$anon$
1.run(AsynchronousListenerBus.scalnerBus.scala:63)

Thanks,
Ajay

On Tue, Oct 25, 2016 at 11:45 PM, Jeff Zhang <zjf...@gmail.com> wrote:

>
> In your sample code, you can use hiveContext in the foreach as it is scala
> List foreach operation which runs in driver side. But you cannot use
> hiveContext in RDD.foreach
>
>
>
> Ajay Chander <itsche...@gmail.com>于2016年10月26日周三 上午11:28写道:
>
>> Hi Everyone,
>>
>> I was thinking if I can use hiveContext inside foreach like below,
>>
>> object Test {
>>   def main(args: Array[String]): Unit = {
>>
>> val conf = new SparkConf()
>> val sc = new SparkContext(conf)
>> val hiveContext = new HiveContext(sc)
>>
>> val dataElementsFile = args(0)
>> val deDF = 
>> hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()
>>
>> def calculate(de: Row) {
>>   val dataElement = de.getAs[String]("DataElement").trim
>>   val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" + 
>> dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM 
>> TEST_DB.TEST_TABLE1 ")
>>   df1.write.insertInto("TEST_DB.TEST_TABLE1")
>> }
>>
>> deDF.collect().foreach(calculate)
>>   }
>> }
>>
>>
>> I looked at 
>> https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>  and I see it is extending SqlContext which extends Logging with 
>> Serializable.
>>
>> Can anyone tell me if this is the right way to use it ? Thanks for your time.
>>
>> Regards,
>>
>> Ajay
>>
>>


HiveContext is Serialized?

2016-10-25 Thread Ajay Chander
Hi Everyone,

I was thinking if I can use hiveContext inside foreach like below,

object Test {
  def main(args: Array[String]): Unit = {

val conf = new SparkConf()
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)

val dataElementsFile = args(0)
val deDF = 
hiveContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()

def calculate(de: Row) {
  val dataElement = de.getAs[String]("DataElement").trim
  val df1 = hiveContext.sql("SELECT cyc_dt, supplier_proc_i, '" +
dataElement + "' as data_elm, " + dataElement + " as data_elm_val FROM
TEST_DB.TEST_TABLE1 ")
  df1.write.insertInto("TEST_DB.TEST_TABLE1")
}

deDF.collect().foreach(calculate)
  }
}


I looked at 
https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
and I see it is extending SqlContext which extends Logging with
Serializable.

Can anyone tell me if this is the right way to use it ? Thanks for your time.

Regards,

Ajay


Re: Code review / sqlContext Scope

2016-10-19 Thread Ajay Chander
Can someone please shed some lights on this. I wrote the below code in
Scala 2.10.5, can someone please tell me if this is the right way of doing
it?


import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.hive.HiveContext

class Test {

  def main(args: Array[String]): Unit = {

val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)

sqlContext.sql("set spark.sql.shuffle.partitions=1000");
sqlContext.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val dataElementsFile = "hdfs://nameservice/user/ajay/spark/flds.txt"

//deDF has only 61 rows
val deDF = 
sqlContext.read.text(dataElementsFile).toDF("DataElement").coalesce(1).distinct().cache()

deDF.withColumn("ds_nm", lit("UDA")).withColumn("tabl_nm",
lit("TEST_DB.TEST_TABLE")).collect().filter(filterByDataset).map(calculateMetricsAtDELevel).foreach(persistResults)


// if ds_nm starts with 'RAW_' I dont want to process it
def filterByDataset(de: Row): Boolean = {
val datasetName = de.getAs[String]("ds_nm").trim
if (datasetName.startsWith("RAW_")) {
return false
}
else {
return true
}
}

def calculateMetricsAtDELevel(de: Row): DataFrame = {
val dataElement = de.getAs[String]("DataElement").trim
val datasetName = de.getAs[String]("ds_nm").trim
val tableName = de.getAs[String]("tabl_nm").trim

// udaDF holds 107,762,849 Rows * 412 Columns / 105 files in HDFS and 176.5
GB * 3 Replication Factor
val udaDF = sqlContext.sql("SELECT '" + datasetName + "' as ds_nm, cyc_dt,
supplier_proc_i, " +
" '" + dataElement + "' as data_elm, " + dataElement + " as data_elm_val
FROM " + tableName + "")

println("udaDF num Partitions: "+udaDF.toJavaRDD.getNumPartitions)
// udaDF.toJavaRDD.getNumPartitions = 1490

val calculatedMetrics = udaDF.groupBy("ds_nm", "cyc_dt", "supplier_proc_i",
"data_elm", "data_elm_val").count()
println("calculatedMetrics num Partitions: "
+calculatedMetrics.toJavaRDD.getNumPartitions)
// calculatedMetrics.toJavaRDD.getNumPartitions = 1000 since I set it to
sqlContext.sql("set spark.sql.shuffle.partitions=1000");

val adjustedSchemaDF = calculatedMetrics.withColumnRenamed("count",
"derx_val_cnt").withColumn("load_dt", current_timestamp())
println("adjustedSchemaDF num Partitions: "
+adjustedSchemaDF.toJavaRDD.getNumPartitions)
// adjustedSchemaDF.toJavaRDD.getNumPartitions = 1000 as well

return adjustedSchemaDF
}

def persistResults(adjustedSchemaDF: DataFrame) = {
// persist the resukts into Hive table backed by PARQUET
adjustedSchemaDF.write.partitionBy("ds_nm", "cyc_dt").mode("Append"
).insertInto("devl_df2_spf_batch.spf_supplier_trans_metric_detl_base_1")
}

}
}

This is my cluster( Spark 1.6.0 on Yarn, Cloudera 5.7.1) configuration,

Memory -> 4.10 TB

VCores -> 544

I am deploying the application in yarn client mode and the cluster is
set to use Dynamic Memory Allocation.

Any pointers are appreciated.

Thank you


On Sat, Oct 8, 2016 at 1:17 PM, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Everyone,
>
> Can anyone tell me if there is anything wrong with my code flow below ?
> Based on each element from the text file I would like to run a query
> against Hive table and persist results in another Hive table. I want to do
> this in parallel for each element in the file. I appreciate any of your
> inputs on this.
>
> $ cat /home/ajay/flds.txt
> PHARMY_NPI_ID
> ALT_SUPPLIER_STORE_NBR
> MAIL_SERV_NBR
>
> spark-shell  --name hivePersistTest --master yarn --deploy-mode client
>
> val dataElementsFile = "/home/ajay/flds.txt"
> val dataElements = Source.fromFile(dataElementsFile).getLines.toArray
>
> def calculateQuery (de: String)  : DataFrame = {
>   val calculatedQuery = sqlContext.sql("select 'UDA' as ds_nm, cyc_dt, 
> supplier_proc_i as supplier_proc_id, '" + de + "' as data_elm, " + de + " as 
> data_elm_val," +
> " count(1) as derx_val_cnt, current_timestamp as load_dt " +
> "from SPRINT2_TEST2 " +
> "group by 'UDA', cyc_dt, supplier_proc_i, '" + de + "' , " + de + " ")
>
>   return calculatedQuery
> }
>
> def persistResults (calculatedQuery: DataFrame) = {
>   calculatedQuery.write.insertInto("sprint2_stp1_test2")
> }
>
> dataElements.map(calculateQuery).foreach(persistResults)
>
>
> Thanks.
>
>


Re: Spark_JDBC_Partitions

2016-09-19 Thread Ajay Chander
t;
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 10 September 2016 at 22:37, ayan guha <guha.a...@gmail.com> wrote:
>>>
>>>> In oracle something called row num is present in every row.  You can
>>>> create an evenly distribution using that column. If it is one time work,
>>>> try using sqoop. Are you using Oracle's own appliance? Then you can use
>>>> data pump format
>>>> On 11 Sep 2016 01:59, "Mich Talebzadeh" <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>> creating an Oracle sequence for a table of 200million is not going to
>>>>> be that easy without changing the schema. It is possible to export that
>>>>> table from prod and import it to DEV/TEST and create the sequence there.
>>>>>
>>>>> If it is a FACT table then the foreign keys from the Dimension tables
>>>>> will be bitmap indexes on the FACT table so they can be potentially used.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>> On 10 September 2016 at 16:42, Takeshi Yamamuro <linguin@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Yea, spark does not have the same functionality with sqoop.
>>>>>> I think one of simple solutions is to assign unique ids on the oracle
>>>>>> table by yourself.
>>>>>> Thought?
>>>>>>
>>>>>> // maropu
>>>>>>
>>>>>>
>>>>>> On Sun, Sep 11, 2016 at 12:37 AM, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>>> Strange that Oracle table of 200Million plus rows has not been
>>>>>>> partitioned.
>>>>>>>
>>>>>>> What matters here is to have parallel connections from JDBC to
>>>>>>> Oracle, each reading a sub-set of table. Any parallel fetch is going to 
>>>>>>> be
>>>>>>> better than reading with one connection from Oracle.
>>>>>>>
>>>>>>> Surely among 404 columns there must be one with high cardinality to
>>>>>>> satisfy this work.
>>>>>>>
>>>>>>> May be you should just create table  as select * from
>>>>>>> Oracle_table where rownum <= 100; and use that for test.
>>>>>>>
>>>>>>> Other alternative is to use Oracle SQL Connecter for HDFS
>>>>>>> <https://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm#BDCUG125>that
>>>>>>> can do it for you. With 404 columns it is difficult to suggest any
>>>>>>> alternative. Is this a FACT table?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * 
>>>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>>> may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>> damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 10 September 2016 at 16:20, Ajay Chander <itsche...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Everyone,
>>>>>>>>
>>>>>>>> My goal is to use Spark Sql to load huge amount of data from Oracle
>>>>>>>> to HDFS.
>>>>>>>>
>>>>>>>> *Table in Oracle:*
>>>>>>>> 1) no primary key.
>>>>>>>> 2) Has 404 columns.
>>>>>>>> 3) Has 200,800,000 rows.
>>>>>>>>
>>>>>>>> *Spark SQL:*
>>>>>>>> In my Spark SQL I want to read the data into n number of partitions
>>>>>>>> in parallel, for which I need to provide 'partition 
>>>>>>>> column','lowerBound',
>>>>>>>> 'upperbound', 'numPartitions' from the table Oracle. My table in 
>>>>>>>> Oracle has
>>>>>>>> no such column to satisfy this need(Highly Skewed), because of it, if 
>>>>>>>> the
>>>>>>>> numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
>>>>>>>> finishes in 20 mins and the last one takes forever.
>>>>>>>>
>>>>>>>> Is there anything I could do to distribute the data evenly into
>>>>>>>> partitions? Can we set any fake query to orchestrate this pull 
>>>>>>>> process, as
>>>>>>>> we do in SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS
>>>>>>>> MIN_MOD_VAL, CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?
>>>>>>>>
>>>>>>>> Any pointers are appreciated.
>>>>>>>>
>>>>>>>> Thanks for your time.
>>>>>>>>
>>>>>>>> ~ Ajay
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>
>>>>>
>>>
>>
>
>


Spark_JDBC_Partitions

2016-09-10 Thread Ajay Chander
Hello Everyone,

My goal is to use Spark Sql to load huge amount of data from Oracle to HDFS.

*Table in Oracle:*
1) no primary key.
2) Has 404 columns.
3) Has 200,800,000 rows.

*Spark SQL:*
In my Spark SQL I want to read the data into n number of partitions in
parallel, for which I need to provide 'partition column','lowerBound',
'upperbound', 'numPartitions' from the table Oracle. My table in Oracle has
no such column to satisfy this need(Highly Skewed), because of it, if the
numPartitions is set to 104, 102 tasks are finished in a minute, 1 task
finishes in 20 mins and the last one takes forever.

Is there anything I could do to distribute the data evenly into partitions?
Can we set any fake query to orchestrate this pull process, as we do in
SQOOP like this '--boundary-query "SELECT CAST(0 AS NUMBER) AS MIN_MOD_VAL,
CAST(12 AS NUMBER) AS MAX_MOD_VAL FROM DUAL"' ?

Any pointers are appreciated.

Thanks for your time.

~ Ajay


Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-22 Thread Ajay Chander
Thanks for the confirmation Mich!

On Wednesday, June 22, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Ajay,
>
> I am afraid for now transaction heart beat do not work through Spark, so I
> have no other solution.
>
> This is interesting point as with Hive running on Spark engine there is no
> issue with this as Hive handles the transactions,
>
> I gather in simplest form Hive has to deal with its metadata for
> transaction logic but Spark somehow cannot do that.
>
> In short that is it. You need to do that through Hive.
>
> Cheers,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 June 2016 at 16:08, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Mich,
>>
>> Right now I have a similar usecase where I have to delete some rows
>> from a hive table. My hive table is of type ORC, Bucketed and included
>> transactional property. I can delete from hive shell but not from my
>> spark-shell or spark app. Were you able to find any work around? Thank
>> you.
>>
>> Regards,
>> Ajay
>>
>>
>> On Thursday, June 2, 2016, Mich Talebzadeh <mich.talebza...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote:
>>
>>> thanks for that.
>>>
>>> I will have a look
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 2 June 2016 at 10:46, Elliot West <tea...@gmail.com> wrote:
>>>
>>>> Related to this, there exists an API in Hive to simplify the
>>>> integrations of other frameworks with Hive's ACID feature:
>>>>
>>>> See:
>>>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API
>>>>
>>>> It contains code for maintaining heartbeats, handling locks and
>>>> transactions, and submitting mutations in a distributed environment.
>>>>
>>>> We have used it to write to transactional tables from Cascading based
>>>> processes.
>>>>
>>>> Elliot.
>>>>
>>>>
>>>> On 2 June 2016 at 09:54, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> Spark does not support transactions because as I understand there is
>>>>> a piece in the execution side that needs to send heartbeats to Hive
>>>>> metastore saying a transaction is still alive". That has not been
>>>>> implemented in Spark yet to my knowledge."
>>>>>
>>>>> Any idea on the timelines when we are going to have support for
>>>>> transactions in Spark for Hive ORC tables. This will really be useful.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>


Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-22 Thread Ajay Chander
Hi Mich,

Right now I have a similar usecase where I have to delete some rows from a
hive table. My hive table is of type ORC, Bucketed and included
transactional property. I can delete from hive shell but not from my
spark-shell or spark app. Were you able to find any work around? Thank you.

Regards,
Ajay

On Thursday, June 2, 2016, Mich Talebzadeh 
wrote:

> thanks for that.
>
> I will have a look
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 2 June 2016 at 10:46, Elliot West  > wrote:
>
>> Related to this, there exists an API in Hive to simplify the integrations
>> of other frameworks with Hive's ACID feature:
>>
>> See:
>> https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API
>>
>> It contains code for maintaining heartbeats, handling locks and
>> transactions, and submitting mutations in a distributed environment.
>>
>> We have used it to write to transactional tables from Cascading based
>> processes.
>>
>> Elliot.
>>
>>
>> On 2 June 2016 at 09:54, Mich Talebzadeh > > wrote:
>>
>>>
>>> Hi,
>>>
>>> Spark does not support transactions because as I understand there is a
>>> piece in the execution side that needs to send heartbeats to Hive metastore
>>> saying a transaction is still alive". That has not been implemented in
>>> Spark yet to my knowledge."
>>>
>>> Any idea on the timelines when we are going to have support for
>>> transactions in Spark for Hive ORC tables. This will really be useful.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>


SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-13 Thread Ajay Chander
Hi Mohit,

Thanks for your time. Please find my response below.

Did you try the same with another database?
I do load the data from MySQL and SQL Server the same way(through SPARK SQL
JDBC) which works perfectly alright.

As a workaround you can write the select statement yourself instead of just
providing the table name?
Yes I did that too. It did not made any difference.

Thank you,
Ajay

On Sunday, June 12, 2016, Mohit Jaggi <mohitja...@gmail.com> wrote:

> Looks like a bug in the code generating the SQL query…why would it be
> specific to SAS, I can’t guess. Did you try the same with another database?
> As a workaround you can write the select statement yourself instead of just
> providing the table name.
>
> On Jun 11, 2016, at 6:27 PM, Ajay Chander <itsche...@gmail.com> wrote:
>
> I tried implementing the same functionality through Scala as well. But no
> luck so far. Just wondering if anyone here tried using Spark SQL to read
> SAS dataset? Thank you
>
> Regards,
> Ajay
>
> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:
>
>> Mich, I completely agree with you. I built another Spark SQL application
>> which reads data from MySQL and SQL server and writes the data into
>> Hive(parquet+snappy format). I have this problem only when I read directly
>> from remote SAS system. The interesting part is I am using same driver to
>> read data through pure Java app and spark app. It works fine in Java
>> app, so I cannot blame SAS driver here. Trying to understand where the
>> problem could be. Thanks for sharing this with me.
>>
>> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>>> I personally use Scala to do something similar. For example here I
>>> extract data from an Oracle table and store in ORC table in Hive. This is
>>> compiled via sbt as run with SparkSubmit.
>>>
>>> It is similar to your code but in Scala. Note that I do not enclose my
>>> column names in double quotes.
>>>
>>> import org.apache.spark.SparkContext
>>> import org.apache.spark.SparkConf
>>> import org.apache.spark.sql.Row
>>> import org.apache.spark.sql.hive.HiveContext
>>> import org.apache.spark.sql.types._
>>> import org.apache.spark.sql.SQLContext
>>> import org.apache.spark.sql.functions._
>>>
>>> object ETL_scratchpad_dummy {
>>>   def main(args: Array[String]) {
>>>   val conf = new SparkConf().
>>>setAppName("ETL_scratchpad_dummy").
>>>set("spark.driver.allowMultipleContexts", "true")
>>>   val sc = new SparkContext(conf)
>>>   // Create sqlContext based on HiveContext
>>>   val sqlContext = new HiveContext(sc)
>>>   import sqlContext.implicits._
>>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>   println ("\nStarted at"); sqlContext.sql("SELECT
>>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
>>> ").collect.foreach(println)
>>>   HiveContext.sql("use oraclehadoop")
>>>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>>>   var _username : String = "scratchpad"
>>>   var _password : String = ""
>>>
>>>   // Get data from Oracle table scratchpad.dummy
>>>   val d = HiveContext.load("jdbc",
>>>   Map("url" -> _ORACLEserver,
>>>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>>   "user" -> _username,
>>>   "password" -> _password))
>>>
>>>d.registerTempTable("tmp")
>>>   //
>>>   // Need to create and populate target ORC table oraclehadoop.dummy
>>>   //
>>>   HiveContext.sql("use oraclehadoop")
>>>   //
>>>   // Drop and create table dummy
>>>   //
>>>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>>>   var sqltext : String = ""
>>>   sqltext = """
>>>   CREATE TABLE oraclehadoop.dummy (
>>>  ID INT
>>>, CLUSTERED INT
>>>, SCATTERED INT
>>>, RANDOMISED INT
>>>, RANDOM_STRING VARCHAR(50)
>>>, SMALL_VC VARCHAR(10)
>>>, PADDING  VARCHAR(10)
>>>   )
>>>   CLUSTERED B

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-11 Thread Ajay Chander
I tried implementing the same functionality through Scala as well. But no
luck so far. Just wondering if anyone here tried using Spark SQL to read
SAS dataset? Thank you

Regards,
Ajay

On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:

> Mich, I completely agree with you. I built another Spark SQL application
> which reads data from MySQL and SQL server and writes the data into
> Hive(parquet+snappy format). I have this problem only when I read directly
> from remote SAS system. The interesting part is I am using same driver to
> read data through pure Java app and spark app. It works fine in Java app,
> so I cannot blame SAS driver here. Trying to understand where the problem
> could be. Thanks for sharing this with me.
>
> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com
> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote:
>
>> I personally use Scala to do something similar. For example here I
>> extract data from an Oracle table and store in ORC table in Hive. This is
>> compiled via sbt as run with SparkSubmit.
>>
>> It is similar to your code but in Scala. Note that I do not enclose my
>> column names in double quotes.
>>
>> import org.apache.spark.SparkContext
>> import org.apache.spark.SparkConf
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.sql.types._
>> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.functions._
>>
>> object ETL_scratchpad_dummy {
>>   def main(args: Array[String]) {
>>   val conf = new SparkConf().
>>setAppName("ETL_scratchpad_dummy").
>>set("spark.driver.allowMultipleContexts", "true")
>>   val sc = new SparkContext(conf)
>>   // Create sqlContext based on HiveContext
>>   val sqlContext = new HiveContext(sc)
>>   import sqlContext.implicits._
>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   println ("\nStarted at"); sqlContext.sql("SELECT
>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
>> ").collect.foreach(println)
>>   HiveContext.sql("use oraclehadoop")
>>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>>   var _username : String = "scratchpad"
>>   var _password : String = ""
>>
>>   // Get data from Oracle table scratchpad.dummy
>>   val d = HiveContext.load("jdbc",
>>   Map("url" -> _ORACLEserver,
>>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
>> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>>   "user" -> _username,
>>   "password" -> _password))
>>
>>d.registerTempTable("tmp")
>>   //
>>   // Need to create and populate target ORC table oraclehadoop.dummy
>>   //
>>   HiveContext.sql("use oraclehadoop")
>>   //
>>   // Drop and create table dummy
>>   //
>>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>>   var sqltext : String = ""
>>   sqltext = """
>>   CREATE TABLE oraclehadoop.dummy (
>>  ID INT
>>, CLUSTERED INT
>>, SCATTERED INT
>>, RANDOMISED INT
>>, RANDOM_STRING VARCHAR(50)
>>, SMALL_VC VARCHAR(10)
>>, PADDING  VARCHAR(10)
>>   )
>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>   STORED AS ORC
>>   TBLPROPERTIES (
>>   "orc.create.index"="true",
>>   "orc.bloom.filter.columns"="ID",
>>   "orc.bloom.filter.fpp"="0.05",
>>   "orc.compress"="SNAPPY",
>>   "orc.stripe.size"="16777216",
>>   "orc.row.index.stride"="1" )
>>   """
>>HiveContext.sql(sqltext)
>>   //
>>   // Put data in Hive table. Clean up is already done
>>   //
>>   sqltext = """
>>   INSERT INTO TABLE oraclehadoop.dummy
>>   SELECT
>>   ID
>> , CLUSTERED
>> , SCATTERED
>> , RANDOMISED
>> , RANDOM_STRING
>> , SMALL_VC
>> , PADDING
>>   FROM tmp
>>   """
>>HiveContext.sql(sqltext)
>>   println ("\nFinished at"); sqlContext.sql("SELECT
>> FROM_unixtime(unix_timestamp

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Mich, I completely agree with you. I built another Spark SQL application
which reads data from MySQL and SQL server and writes the data into
Hive(parquet+snappy format). I have this problem only when I read directly
from remote SAS system. The interesting part is I am using same driver to
read data through pure Java app and spark app. It works fine in Java app,
so I cannot blame SAS driver here. Trying to understand where the problem
could be. Thanks for sharing this with me.

On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> I personally use Scala to do something similar. For example here I extract
> data from an Oracle table and store in ORC table in Hive. This is compiled
> via sbt as run with SparkSubmit.
>
> It is similar to your code but in Scala. Note that I do not enclose my
> column names in double quotes.
>
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions._
>
> object ETL_scratchpad_dummy {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>setAppName("ETL_scratchpad_dummy").
>set("spark.driver.allowMultipleContexts", "true")
>   val sc = new SparkContext(conf)
>   // Create sqlContext based on HiveContext
>   val sqlContext = new HiveContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   println ("\nStarted at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>   HiveContext.sql("use oraclehadoop")
>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>   var _username : String = "scratchpad"
>   var _password : String = ""
>
>   // Get data from Oracle table scratchpad.dummy
>   val d = HiveContext.load("jdbc",
>   Map("url" -> _ORACLEserver,
>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS
> CLUSTERED, to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS
> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>   "user" -> _username,
>   "password" -> _password))
>
>d.registerTempTable("tmp")
>   //
>   // Need to create and populate target ORC table oraclehadoop.dummy
>   //
>   HiveContext.sql("use oraclehadoop")
>   //
>   // Drop and create table dummy
>   //
>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>   var sqltext : String = ""
>   sqltext = """
>   CREATE TABLE oraclehadoop.dummy (
>  ID INT
>, CLUSTERED INT
>, SCATTERED INT
>, RANDOMISED INT
>, RANDOM_STRING VARCHAR(50)
>, SMALL_VC VARCHAR(10)
>, PADDING  VARCHAR(10)
>   )
>   CLUSTERED BY (ID) INTO 256 BUCKETS
>   STORED AS ORC
>   TBLPROPERTIES (
>   "orc.create.index"="true",
>   "orc.bloom.filter.columns"="ID",
>   "orc.bloom.filter.fpp"="0.05",
>   "orc.compress"="SNAPPY",
>   "orc.stripe.size"="16777216",
>   "orc.row.index.stride"="1" )
>   """
>HiveContext.sql(sqltext)
>   //
>   // Put data in Hive table. Clean up is already done
>   //
>   sqltext = """
>   INSERT INTO TABLE oraclehadoop.dummy
>   SELECT
>   ID
> , CLUSTERED
>     , SCATTERED
> , RANDOMISED
> , RANDOM_STRING
> , SMALL_VC
> , PADDING
>   FROM tmp
>   """
>HiveContext.sql(sqltext)
>   println ("\nFinished at"); sqlContext.sql("SELECT
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
> ").collect.foreach(println)
>   sys.exit()
>  }
> }
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 June 2016 at 23:38, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Mich,
>>
>> Thanks for the response. If you look at my programs, I am not writings my
>> queries to include column names in a pair of ""

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi Mich,

Thanks for the response. If you look at my programs, I am not writings my
queries to include column names in a pair of "". My driver in spark
program is generating such query with column names in "" which I do not
want. On the other hand, I am using the same driver in my pure Java program
which is attached, in that program the same driver is generating a proper
sql query with out "".

Pure Java log:

2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
Spark SQL log:

[2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)

[2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
Please find complete program and full logs attached in the below thread.
Thank you.

Regards,
Ajay

On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> Assuming I understood your query, in Spark SQL (that is you log in to
> spark sql like  spark-sql --master spark://:7077 you do not
> need double quotes around column names for sql to work
>
> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
> hello from Mich
>
> Anything between a pair of "" will be interpreted as text NOT column name.
>
> In Spark SQL you do not need double quotes. So simply
>
> spark-sql> select prod_id, cust_id from sales limit 2;
> 17  28017
> 18  10419
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 June 2016 at 21:54, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi again, anyone in this group tried to access SAS dataset through Spark
>> SQL ? Thank you
>>
>> Regards,
>> Ajay
>>
>>
>> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>>
>>> Hi Spark Users,
>>>
>>> I hope everyone here are doing great.
>>>
>>> I am trying to read data from SAS through Spark SQL and write into HDFS.
>>> Initially, I started with pure java program please find the program and
>>> logs in the attached file sas_pure_java.txt . My program ran
>>> successfully and it returned the data from Sas to Spark_SQL. Please
>>> note the highlighted part in the log.
>>>
>>> My SAS dataset has 4 rows,
>>>
>>> Program ran successfully. So my output is,
>>>
>>> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
>>> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result
>>> set 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>>>
>>> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next);
>>> time= 0.045 secs (com.sas.rio.MVAResultSet:773)
>>>
>>> 1,'2016-01-01','2016-01-31'
>>>
>>> 2,'2016-02-01','2016-02-29'
>>>
>>> 3,'2016-03-01','2016-03-31'
>>>
>>> 4,'2016-04-01','2016-04-30'
>>>
>>>
>>> Please find the full logs attached to this email in file
>>> sas_pure_java.txt.
>>>
>>> ___
>>>
>>>
>>> Now I am trying to do the same via Spark SQL. Please find my program
>>> and logs attached to this email in file sas_spark_sql.txt .
>>>
>>> Connection to SAS dataset is established successfully. But please note
>>> the highlighted log below.
>>>
>>> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared
>>> statement 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>>>
>>> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
>>> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result
>>> set 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
>>> Please find the full logs attached to this email in file
>>>  sas_spark_sql.txt
>>>
>>> I am using same driver in both pure java and spark sql programs. But the
>>> query generated in spark sql has quotes around the column names(Highlighted
>>> above).
>>> So my resulting output for that query is like this,
>>>
>>> +-++--+
>>> |  _c0| _c1|   _c2|
>>> +-++--+
>>> |SR_NO|start_dt|end_dt|
>>> |SR_NO|start_dt|end_dt|
>>> |SR_NO|start_dt|end_dt|
>>> |SR_NO|start_dt|end_dt|
>>> +-++--+
>>>
>>> Since both programs are using the same driver com.sas.rio.MVADriver .
>>> Expected output should be same as my pure java programs output. But
>>> something else is happening behind the scenes.
>>>
>>> Any insights on this issue. Thanks for your time.
>>>
>>>
>>> Regards,
>>>
>>> Ajay
>>>
>>
>


Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi again, anyone in this group tried to access SAS dataset through Spark
SQL ? Thank you

Regards,
Ajay

On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Spark Users,
>
> I hope everyone here are doing great.
>
> I am trying to read data from SAS through Spark SQL and write into HDFS.
> Initially, I started with pure java program please find the program and
> logs in the attached file sas_pure_java.txt . My program ran successfully
> and it returned the data from Sas to Spark_SQL. Please note the
> highlighted part in the log.
>
> My SAS dataset has 4 rows,
>
> Program ran successfully. So my output is,
>
> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
> 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
>
> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next); time=
> 0.045 secs (com.sas.rio.MVAResultSet:773)
>
> 1,'2016-01-01','2016-01-31'
>
> 2,'2016-02-01','2016-02-29'
>
> 3,'2016-03-01','2016-03-31'
>
> 4,'2016-04-01','2016-04-30'
>
>
> Please find the full logs attached to this email in file sas_pure_java.txt.
>
> ___
>
>
> Now I am trying to do the same via Spark SQL. Please find my program and
> logs attached to this email in file sas_spark_sql.txt .
>
> Connection to SAS dataset is established successfully. But please note
> the highlighted log below.
>
> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
> 2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)
>
> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
> Please find the full logs attached to this email in file
>  sas_spark_sql.txt
>
> I am using same driver in both pure java and spark sql programs. But the
> query generated in spark sql has quotes around the column names(Highlighted
> above).
> So my resulting output for that query is like this,
>
> +-++--+
> |  _c0| _c1|   _c2|
> +-++--+
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> +-++--+
>
> Since both programs are using the same driver com.sas.rio.MVADriver .
> Expected output should be same as my pure java programs output. But
> something else is happening behind the scenes.
>
> Any insights on this issue. Thanks for your time.
>
>
> Regards,
>
> Ajay
>


SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Ajay Chander
Hi Spark Users,

I hope everyone here are doing great.

I am trying to read data from SAS through Spark SQL and write into HDFS.
Initially, I started with pure java program please find the program and
logs in the attached file sas_pure_java.txt . My program ran successfully
and it returned the data from Sas to Spark_SQL. Please note the highlighted
part in the log.

My SAS dataset has 4 rows,

Program ran successfully. So my output is,

[2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT
a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set
1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)

[2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next); time=
0.045 secs (com.sas.rio.MVAResultSet:773)

1,'2016-01-01','2016-01-31'

2,'2016-02-01','2016-02-29'

3,'2016-03-01','2016-03-31'

4,'2016-04-01','2016-04-30'


Please find the full logs attached to this email in file sas_pure_java.txt.

___


Now I am trying to do the same via Spark SQL. Please find my program and
logs attached to this email in file sas_spark_sql.txt .

Connection to SAS dataset is established successfully. But please note the
highlighted log below.

[2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement
2.1; time= 0.038 secs (com.sas.rio.MVAConnection:538)

[2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT
"SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set
2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
Please find the full logs attached to this email in file sas_spark_sql.txt

I am using same driver in both pure java and spark sql programs. But the
query generated in spark sql has quotes around the column names(Highlighted
above).
So my resulting output for that query is like this,

+-++--+
|  _c0| _c1|   _c2|
+-++--+
|SR_NO|start_dt|end_dt|
|SR_NO|start_dt|end_dt|
|SR_NO|start_dt|end_dt|
|SR_NO|start_dt|end_dt|
+-++--+

Since both programs are using the same driver com.sas.rio.MVADriver .
Expected output should be same as my pure java programs output. But
something else is happening behind the scenes.

Any insights on this issue. Thanks for your time.


Regards,

Ajay
Spark Code to read SAS dataset
-


package com.test.sas.connections;

import java.sql.SQLException;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.hive.HiveContext;


public class SparkSasConnectionTest {

public static void main(String args[]) throws SQLException, 
ClassNotFoundException {

SparkConf sc = new 
SparkConf().setAppName("SASSparkJdbcTest").setMaster("local");
@SuppressWarnings("resource")
JavaSparkContext jsc = new JavaSparkContext(sc);
HiveContext hiveContext = new 
org.apache.spark.sql.hive.HiveContext(jsc.sc());

Properties props = new Properties();
props.setProperty("user", "Ajay");
props.setProperty("password", "Ajay");
props.setProperty("librefs", "sasLib '/export/home/Ajay'");
props.setProperty("usesspi", "none");
props.setProperty("encryptionPolicy", "required");
props.setProperty("encryptionAlgorithms", "AES");

Class.forName("com.sas.rio.MVADriver");

Map options = new HashMap();
options.put("driver", "com.sas.rio.MVADriver");

DataFrame jdbcDF = 
hiveContext.read().jdbc("jdbc:sasiom://remote1.system.com:8594","sasLib.run_control",
 props);
jdbcDF.show();
}

}





Spark Log
-

[2016-06-10 10:28:26,812] INFO Running Spark version 1.5.0 
(org.apache.spark.SparkContext:59)
[2016-06-10 10:28:27,024] WARN Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable 
(org.apache.hadoop.util.NativeCodeLoader:62)
[2016-06-10 10:28:27,588] INFO Changing view acls to: Ajay 
(org.apache.spark.SecurityManager:59)
[2016-06-10 10:28:27,589] INFO Changing modify acls to: Ajay 
(org.apache.spark.SecurityManager:59)
[2016-06-10 10:28:27,589] INFO SecurityManager: authentication disabled; ui 
acls disabled; users with view permissions: Set(Ajay); users with modify 
permissions: Set(AE10302) (org.apache.spark.SecurityManager:59)
[2016-06-10 10:28:28,012] INFO Slf4jLogger started 
(akka.event.slf4j.Slf4jLogger:80)
[2016-06-10 10:28:28,039] INFO Starting remoting (Remoting:74)
[2016-06-10 10:28:28,153] INFO Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@10.0.0.9:49499] (Remoting:74)
[2016-06-10 10:28:28,158] INFO Successfully started service 

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Deepak, thanks for the info. I was thinking of reading both source and
destination tables into separate rdds/dataframes, then apply some specific
transformations to find the updated info, remove updated keyed rows from
destination and append updated info to the destination. Any pointers on
this kind of usage ?

It would be great, If it is possible for you to provide an example with
regards to what you mentioned below? Thanks much.

Regards,
Aj

On Tuesday, June 7, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote:

> I am not sure if Spark provides any support for incremental extracts
> inherently.
> But you can maintain a file e.g. extractRange.conf in hdfs , to read from
> it the end range and update it with new end range from  spark job before it
> finishes with the new relevant ranges to be used next time.
>
> On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
>> Now I am using spark to do the same. Right now, I am trying
>> to implement incremental updates while loading from MySQL through spark.
>> Can you suggest any best practices for this ? Thank you.
>>
>>
>> On Tuesday, June 7, 2016, Mich Talebzadeh <mich.talebza...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote:
>>
>>> I use Spark rather that Sqoop to import data from an Oracle table into a
>>> Hive ORC table.
>>>
>>> It used JDBC for this purpose. All inclusive in Scala itself.
>>>
>>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>>> map-reduce/.
>>>
>>> pretty simple.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 7 June 2016 at 15:38, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> bq. load the data from edge node to hdfs
>>>>
>>>> Does the loading involve accessing sqlserver ?
>>>>
>>>> Please take a look at
>>>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>>>
>>>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mmistr...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi
>>>>> how about
>>>>>
>>>>> 1.  have a process that read the data from your sqlserver and dumps it
>>>>> as a file into a directory on your hd
>>>>> 2. use spark-streanming to read data from that directory  and store it
>>>>> into hdfs
>>>>>
>>>>> perhaps there is some sort of spark 'connectors' that allows you to
>>>>> read data from a db directly so you dont need to go via spk streaming?
>>>>>
>>>>>
>>>>> hth
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <itsche...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Spark users,
>>>>>>
>>>>>> Right now we are using spark for everything(loading the data from
>>>>>> sqlserver, apply transformations, save it as permanent tables in
>>>>>> hive) in our environment. Everything is being done in one spark 
>>>>>> application.
>>>>>>
>>>>>> The only thing we do before we launch our spark application through
>>>>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>>>>> through a ssh action from oozie to run shell script on edge node).
>>>>>>
>>>>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>>>>> through spark ? So that everything is done in one spark DAG and lineage
>>>>>> graph?
>>>>>>
>>>>>> Any pointers are highly appreciated. Thanks
>>>>>>
>>>>>> Regards,
>>>>>> Aj
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>


Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
Now I am using spark to do the same. Right now, I am trying
to implement incremental updates while loading from MySQL through spark.
Can you suggest any best practices for this ? Thank you.


On Tuesday, June 7, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> I use Spark rather that Sqoop to import data from an Oracle table into a
> Hive ORC table.
>
> It used JDBC for this purpose. All inclusive in Scala itself.
>
> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
> map-reduce/.
>
> pretty simple.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 15:38, Ted Yu <yuzhih...@gmail.com
> <javascript:_e(%7B%7D,'cvml','yuzhih...@gmail.com');>> wrote:
>
>> bq. load the data from edge node to hdfs
>>
>> Does the loading involve accessing sqlserver ?
>>
>> Please take a look at
>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mmistr...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mmistr...@gmail.com');>> wrote:
>>
>>> Hi
>>> how about
>>>
>>> 1.  have a process that read the data from your sqlserver and dumps it
>>> as a file into a directory on your hd
>>> 2. use spark-streanming to read data from that directory  and store it
>>> into hdfs
>>>
>>> perhaps there is some sort of spark 'connectors' that allows you to read
>>> data from a db directly so you dont need to go via spk streaming?
>>>
>>>
>>> hth
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <itsche...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>>>
>>>> Hi Spark users,
>>>>
>>>> Right now we are using spark for everything(loading the data from
>>>> sqlserver, apply transformations, save it as permanent tables in
>>>> hive) in our environment. Everything is being done in one spark 
>>>> application.
>>>>
>>>> The only thing we do before we launch our spark application through
>>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>>> through a ssh action from oozie to run shell script on edge node).
>>>>
>>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>>> through spark ? So that everything is done in one spark DAG and lineage
>>>> graph?
>>>>
>>>> Any pointers are highly appreciated. Thanks
>>>>
>>>> Regards,
>>>> Aj
>>>>
>>>
>>>
>>
>


Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Marco, Ted, thanks for your time. I am sorry if I wasn't clear enough. We
have two sources,

1) sql server
2) files are pushed onto edge node by upstreams on a daily basis.

Point 1 has been achieved by using JDBC format in spark sql.

Point 2 has been achieved by using shell script.

My only concern is about point 2. To see if there is any way I can do it in
my spark app instead os shell script.

Thanks.

On Tuesday, June 7, 2016, Ted Yu <yuzhih...@gmail.com> wrote:

> bq. load the data from edge node to hdfs
>
> Does the loading involve accessing sqlserver ?
>
> Please take a look at
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni <mmistr...@gmail.com
> <javascript:_e(%7B%7D,'cvml','mmistr...@gmail.com');>> wrote:
>
>> Hi
>> how about
>>
>> 1.  have a process that read the data from your sqlserver and dumps it as
>> a file into a directory on your hd
>> 2. use spark-streanming to read data from that directory  and store it
>> into hdfs
>>
>> perhaps there is some sort of spark 'connectors' that allows you to read
>> data from a db directly so you dont need to go via spk streaming?
>>
>>
>> hth
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander <itsche...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>>
>>> Hi Spark users,
>>>
>>> Right now we are using spark for everything(loading the data from
>>> sqlserver, apply transformations, save it as permanent tables in
>>> hive) in our environment. Everything is being done in one spark application.
>>>
>>> The only thing we do before we launch our spark application through
>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>> through a ssh action from oozie to run shell script on edge node).
>>>
>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>> through spark ? So that everything is done in one spark DAG and lineage
>>> graph?
>>>
>>> Any pointers are highly appreciated. Thanks
>>>
>>> Regards,
>>> Aj
>>>
>>
>>
>


Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Spark users,

Right now we are using spark for everything(loading the data from
sqlserver, apply transformations, save it as permanent tables in hive) in
our environment. Everything is being done in one spark application.

The only thing we do before we launch our spark application through
oozie is, to load the data from edge node to hdfs(it is being triggered
through a ssh action from oozie to run shell script on edge node).

My question is,  there's any way we can accomplish edge-to-hdfs copy
through spark ? So that everything is done in one spark DAG and lineage
graph?

Any pointers are highly appreciated. Thanks

Regards,
Aj


Re: how to get file name of record being reading in spark

2016-05-31 Thread Ajay Chander
Hi Vikash,

These are my thoughts, read the input directory using wholeTextFiles()
which would give a paired RDD with key as file name and value as file
content. Then you can apply a map function to read each line and append key
to the content.

Thank you,
Aj

On Tuesday, May 31, 2016, Vikash Kumar  wrote:

> I have a requirement in which I need to read the input files from a
> directory and append the file name in each record while output.
>
> e.g. I have directory /input/files/ which have folllowing files:
> ABC_input_0528.txt
> ABC_input_0531.txt
>
> suppose input file ABC_input_0528.txt contains
> 111,abc,234
> 222,xyz,456
>
> suppose input file ABC_input_0531.txt contains
> 100,abc,299
> 200,xyz,499
>
> and I need to create one final output with file name in each record using
> dataframes
> my output file should looks like this:
> 111,abc,234,ABC_input_0528.txt
> 222,xyz,456,ABC_input_0528.txt
> 100,abc,299,ABC_input_0531.txt
> 200,xyz,499,ABC_input_0531.txt
>
> I am trying to use this inputFileName function but it is showing blank.
>
> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#inputFileName()
>
> Can anybody help me?
>
>


Re: Spark_API_Copy_From_Edgenode

2016-05-28 Thread Ajay Chander
Hi Everyone, Any insights on this thread? Thank you.

On Friday, May 27, 2016, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Everyone,
>
>I have some data located on the EdgeNode. Right
> now, the process I follow to copy the data from Edgenode to HDFS is through
> a shellscript which resides on Edgenode. In Oozie I am using a SSH action
> to execute the shell script on Edgenode which copies the data to HDFS.
>
>   I was just wondering if there is any built in
> API with in Spark to do this job. I want to read the data from Edgenode
> into RDD using JavaSparkContext then do saveAsTextFile("hdfs://...").
> JavaSparkContext  does provide any method to pass Edgenode's access
> credentials and read the data into an RDD ??
>
> Thank you for your valuable time. Any pointers are appreciated.
>
> Thank You,
> Aj
>


Spark_API_Copy_From_Edgenode

2016-05-27 Thread Ajay Chander
Hi Everyone,

   I have some data located on the EdgeNode. Right
now, the process I follow to copy the data from Edgenode to HDFS is through
a shellscript which resides on Edgenode. In Oozie I am using a SSH action
to execute the shell script on Edgenode which copies the data to HDFS.

  I was just wondering if there is any built in API
with in Spark to do this job. I want to read the data from Edgenode into
RDD using JavaSparkContext then do saveAsTextFile("hdfs://...").
JavaSparkContext  does provide any method to pass Edgenode's access
credentials and read the data into an RDD ??

Thank you for your valuable time. Any pointers are appreciated.

Thank You,
Aj


Re: Hive_context

2016-05-24 Thread Ajay Chander
Hi Arun,

Thanks for your time. I was able to connect through JDBC Java client. But I
am not able to connect from my spark application. You think I missed any
configuration step with in the code? Somehow my application is not picking
up  hive-site.xml from my machine, I put it under the class
path  ${SPARK_HOME}/conf/ . It would be really helpful if anyone has any
sort of example in either Java or Scala? Thank you.

On Monday, May 23, 2016, Arun Natva <arun.na...@gmail.com> wrote:

> Can you try a hive JDBC java client from eclipse and query a hive table
> successfully ?
>
> This way we can narrow down where the issue is ?
>
>
> Sent from my iPhone
>
> On May 23, 2016, at 5:26 PM, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
> I downloaded the spark 1.5 untilities and exported SPARK_HOME pointing to
> it. I copied all the cluster configuration files(hive-site.xml,
> hdfs-site.xml etc files) inside the ${SPARK_HOME}/conf/ . My application
> looks like below,
>
>
> public class SparkSqlTest {
>
> public static void main(String[] args) {
>
>
> SparkConf sc = new SparkConf().setAppName("SQL_Test").setMaster("local");
>
> JavaSparkContext jsc = new JavaSparkContext(sc);
>
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(jsc
> .sc());
>
> DataFrame sampleDataFrame = hiveContext.sql("show tables");
>
> sampleDataFrame.show();
>
>
> }
>
> }
>
>
> I am expecting my application to return all the tables from the default
> database. But somehow it returns empty list. I am just wondering if I need
> to add anything to my code to point it to hive metastore. Thanks for your
> time. Any pointers are appreciated.
>
>
> Regards,
>
> Aj
>
>
> On Monday, May 23, 2016, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Everyone,
>>
>> I am building a Java Spark application in eclipse IDE. From my
>> application I want to use hiveContext to read tables from the remote
>> Hive(Hadoop cluster). On my machine I have exported $HADOOP_CONF_DIR =
>> {$HOME}/hadoop/conf/. This path has all the remote cluster conf details
>> like hive-site.xml, hdfs-site.xml ... Somehow I am not able to communicate
>> to remote cluster from my app. Is there any additional configuration work
>> that I am supposed to do to get it work? I specified master as 'local' in
>> the code. Thank you.
>>
>> Regards,
>> Aj
>>
>


Re: Hive_context

2016-05-23 Thread Ajay Chander
I downloaded the spark 1.5 untilities and exported SPARK_HOME pointing to
it. I copied all the cluster configuration files(hive-site.xml,
hdfs-site.xml etc files) inside the ${SPARK_HOME}/conf/ . My application
looks like below,


public class SparkSqlTest {

public static void main(String[] args) {


SparkConf sc = new SparkConf().setAppName("SQL_Test").setMaster("local");

JavaSparkContext jsc = new JavaSparkContext(sc);

HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(jsc
.sc());

DataFrame sampleDataFrame = hiveContext.sql("show tables");

sampleDataFrame.show();


}

}


I am expecting my application to return all the tables from the default
database. But somehow it returns empty list. I am just wondering if I need
to add anything to my code to point it to hive metastore. Thanks for your
time. Any pointers are appreciated.


Regards,

Aj


On Monday, May 23, 2016, Ajay Chander <itsche...@gmail.com> wrote:

> Hi Everyone,
>
> I am building a Java Spark application in eclipse IDE. From my application
> I want to use hiveContext to read tables from the remote Hive(Hadoop
> cluster). On my machine I have exported $HADOOP_CONF_DIR =
> {$HOME}/hadoop/conf/. This path has all the remote cluster conf details
> like hive-site.xml, hdfs-site.xml ... Somehow I am not able to communicate
> to remote cluster from my app. Is there any additional configuration work
> that I am supposed to do to get it work? I specified master as 'local' in
> the code. Thank you.
>
> Regards,
> Aj
>


Hive_context

2016-05-23 Thread Ajay Chander
Hi Everyone,

I am building a Java Spark application in eclipse IDE. From my application
I want to use hiveContext to read tables from the remote Hive(Hadoop
cluster). On my machine I have exported $HADOOP_CONF_DIR =
{$HOME}/hadoop/conf/. This path has all the remote cluster conf details
like hive-site.xml, hdfs-site.xml ... Somehow I am not able to communicate
to remote cluster from my app. Is there any additional configuration work
that I am supposed to do to get it work? I specified master as 'local' in
the code. Thank you.

Regards,
Aj


Re: Cluster Migration

2016-05-10 Thread Ajay Chander
Never mind! I figured it out by saving it as hadoopfile and passing the
codec to it. Thank you!

On Tuesday, May 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:

> Hi, I have a folder temp1 in hdfs which have multiple format files
> test1.txt, test2.avsc (Avro file) in it. Now I want to compress these files
> together and store it under temp2 folder in hdfs. Expecting that temp2
> folder will have one file test_compress.gz which has test1.txt and
> test2.avsc under it. Is there any possible/effiencient way to achieve this?
>
> Thanks,
> Aj
>
> On Tuesday, May 10, 2016, Ajay Chander <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> I will try that out. Thank you!
>>
>> On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote:
>>
>>> Yes that's what I intended to say.
>>>
>>> Thanks
>>> Deepak
>>> On 10 May 2016 11:47 pm, "Ajay Chander" <itsche...@gmail.com> wrote:
>>>
>>>> Hi Deepak,
>>>>Thanks for your response. If I am correct, you suggest reading
>>>> all of those files into an rdd on the cluster using wholeTextFiles then
>>>> apply compression codec on it, save the rdd to another Hadoop cluster?
>>>>
>>>> Thank you,
>>>> Ajay
>>>>
>>>> On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote:
>>>>
>>>>> Hi Ajay
>>>>> You can look at wholeTextFiles method of rdd[string,string] and then
>>>>> map each of rdd  to saveAsTextFile .
>>>>> This will serve the purpose .
>>>>> I don't think if anything default like distcp exists in spark
>>>>>
>>>>> Thanks
>>>>> Deepak
>>>>> On 10 May 2016 11:27 pm, "Ajay Chander" <itsche...@gmail.com> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> we are planning to migrate the data between 2 clusters and I see
>>>>>> distcp doesn't support data compression. Is there any efficient way to
>>>>>> compress the data during the migration ? Can I implement any spark job to
>>>>>> do this ? Thanks.
>>>>>>
>>>>>


Re: Cluster Migration

2016-05-10 Thread Ajay Chander
Hi, I have a folder temp1 in hdfs which have multiple format files
test1.txt, test2.avsc (Avro file) in it. Now I want to compress these files
together and store it under temp2 folder in hdfs. Expecting that temp2
folder will have one file test_compress.gz which has test1.txt and
test2.avsc under it. Is there any possible/effiencient way to achieve this?

Thanks,
Aj

On Tuesday, May 10, 2016, Ajay Chander <itsche...@gmail.com> wrote:

> I will try that out. Thank you!
>
> On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com
> <javascript:_e(%7B%7D,'cvml','deepakmc...@gmail.com');>> wrote:
>
>> Yes that's what I intended to say.
>>
>> Thanks
>> Deepak
>> On 10 May 2016 11:47 pm, "Ajay Chander" <itsche...@gmail.com> wrote:
>>
>>> Hi Deepak,
>>>Thanks for your response. If I am correct, you suggest reading
>>> all of those files into an rdd on the cluster using wholeTextFiles then
>>> apply compression codec on it, save the rdd to another Hadoop cluster?
>>>
>>> Thank you,
>>> Ajay
>>>
>>> On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote:
>>>
>>>> Hi Ajay
>>>> You can look at wholeTextFiles method of rdd[string,string] and then
>>>> map each of rdd  to saveAsTextFile .
>>>> This will serve the purpose .
>>>> I don't think if anything default like distcp exists in spark
>>>>
>>>> Thanks
>>>> Deepak
>>>> On 10 May 2016 11:27 pm, "Ajay Chander" <itsche...@gmail.com> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> we are planning to migrate the data between 2 clusters and I see
>>>>> distcp doesn't support data compression. Is there any efficient way to
>>>>> compress the data during the migration ? Can I implement any spark job to
>>>>> do this ? Thanks.
>>>>>
>>>>


Re: Cluster Migration

2016-05-10 Thread Ajay Chander
Hi Deepak,
   Thanks for your response. If I am correct, you suggest reading all
of those files into an rdd on the cluster using wholeTextFiles then apply
compression codec on it, save the rdd to another Hadoop cluster?

Thank you,
Ajay

On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote:

> Hi Ajay
> You can look at wholeTextFiles method of rdd[string,string] and then map
> each of rdd  to saveAsTextFile .
> This will serve the purpose .
> I don't think if anything default like distcp exists in spark
>
> Thanks
> Deepak
> On 10 May 2016 11:27 pm, "Ajay Chander" <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Everyone,
>>
>> we are planning to migrate the data between 2 clusters and I see distcp
>> doesn't support data compression. Is there any efficient way to compress
>> the data during the migration ? Can I implement any spark job to do this ?
>>  Thanks.
>>
>


Re: Cluster Migration

2016-05-10 Thread Ajay Chander
I will try that out. Thank you!

On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com> wrote:

> Yes that's what I intended to say.
>
> Thanks
> Deepak
> On 10 May 2016 11:47 pm, "Ajay Chander" <itsche...@gmail.com
> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>
>> Hi Deepak,
>>Thanks for your response. If I am correct, you suggest reading all
>> of those files into an rdd on the cluster using wholeTextFiles then apply
>> compression codec on it, save the rdd to another Hadoop cluster?
>>
>> Thank you,
>> Ajay
>>
>> On Tuesday, May 10, 2016, Deepak Sharma <deepakmc...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','deepakmc...@gmail.com');>> wrote:
>>
>>> Hi Ajay
>>> You can look at wholeTextFiles method of rdd[string,string] and then map
>>> each of rdd  to saveAsTextFile .
>>> This will serve the purpose .
>>> I don't think if anything default like distcp exists in spark
>>>
>>> Thanks
>>> Deepak
>>> On 10 May 2016 11:27 pm, "Ajay Chander" <itsche...@gmail.com> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> we are planning to migrate the data between 2 clusters and I see distcp
>>>> doesn't support data compression. Is there any efficient way to compress
>>>> the data during the migration ? Can I implement any spark job to do this ?
>>>>  Thanks.
>>>>
>>>


Cluster Migration

2016-05-10 Thread Ajay Chander
Hi Everyone,

we are planning to migrate the data between 2 clusters and I see distcp
doesn't support data compression. Is there any efficient way to compress
the data during the migration ? Can I implement any spark job to do this ?
 Thanks.


Re: Converting a string of format of 'dd/MM/yyyy' in Spark sql

2016-03-24 Thread Ajay Chander
Mich,

Can you try the value for paymentdata to this
format  paymentdata='2015-01-01 23:59:59' , to_date(paymentdate) and see if
it helps.

On Thursday, March 24, 2016, Tamas Szuromi 
wrote:

> Hi Mich,
>
> Take a look
> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/functions.html#unix_timestamp(org.apache.spark.sql.Column,%20java.lang.String)
>
> cheers,
> Tamas
>
>
> On 24 March 2016 at 14:29, Mich Talebzadeh  > wrote:
>
>>
>> Hi,
>>
>> I am trying to convert a date in Spark temporary table
>>
>> Tried few approaches.
>>
>> scala> sql("select paymentdate, to_date(paymentdate) from tmp")
>> res21: org.apache.spark.sql.DataFrame = [paymentdate: string, _c1: date]
>>
>>
>> scala> sql("select paymentdate, to_date(paymentdate) from tmp").first
>> *res22: org.apache.spark.sql.Row = [10/02/2014,null]*
>>
>> My date is stored as String dd/MM/ as shown above. However, to_date()
>> returns null!
>>
>>
>> Thanks
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


Re: Problem mixing MESOS Cluster Mode and Docker task execution

2016-03-10 Thread Ajay Chander
Hi Everyone, a quick question with in this context. What is the underneath
persistent storage that you guys are using? With regards to this
containerized environment? Thanks

On Thursday, March 10, 2016, yanlin wang  wrote:

> How you guys make driver docker within container to be reachable from
> spark worker ?
>
> Would you share your driver docker? i am trying to put only driver in
> docker and spark running with yarn outside of container and i don’t want to
> use —net=host
>
> Thx
> Yanlin
>
> On Mar 10, 2016, at 11:06 AM, Guillaume Eynard Bontemps <
> g.eynard.bonte...@gmail.com
> > wrote:
>
> Glad to hear it. Thanks all  for sharing your  solutions.
>
> Le jeu. 10 mars 2016 19:19, Eran Chinthaka Withana <
> eran.chinth...@gmail.com
> > a écrit :
>
>> Phew, it worked. All I had to do was to add *export
>> SPARK_JAVA_OPTS="-Dspark.mesos.executor.docker.image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6"
>> *before calling spark-submit. Guillaume, thanks for the pointer.
>>
>> Timothy, thanks for looking into this. Looking forward to see a fix soon.
>>
>> Thanks,
>> Eran Chinthaka Withana
>>
>> On Thu, Mar 10, 2016 at 10:10 AM, Tim Chen > > wrote:
>>
>>> Hi Eran,
>>>
>>> I need to investigate but perhaps that's true, we're using
>>> SPARK_JAVA_OPTS to pass all the options and not --conf.
>>>
>>> I'll take a look at the bug, but if you can try the workaround and see
>>> if that fixes your problem.
>>>
>>> Tim
>>>
>>> On Thu, Mar 10, 2016 at 10:08 AM, Eran Chinthaka Withana <
>>> eran.chinth...@gmail.com
>>> > wrote:
>>>
 Hi Timothy

 What version of spark are you guys running?
>

 I'm using Spark 1.6.0. You can see the Dockerfile I used here:
 https://github.com/echinthaka/spark-mesos-docker/blob/master/docker/mesos-spark/Dockerfile



> And also did you set the working dir in your image to be spark home?
>

 Yes I did. You can see it here: https://goo.gl/8PxtV8

 Can it be because of this:
 https://issues.apache.org/jira/browse/SPARK-13258 as Guillaume pointed
 out above? As you can see, I'm passing in the docker image URI through
 spark-submit (--conf spark.mesos.executor.docker.
 image=echinthaka/mesos-spark:0.23.1-1.6.0-2.6)

 Thanks,
 Eran



>>>
>>
>


Re: Facing issue with floor function in spark SQL query

2016-03-04 Thread Ajay Chander
Hi Ashok,

Try using hivecontext instead of sqlcontext. I suspect sqlcontext doesnot
have that functionality. Let me know if it works.

Thanks,
Ajay

On Friday, March 4, 2016, ashokkumar rajendran <
ashokkumar.rajend...@gmail.com> wrote:

> Hi Ayan,
>
> Thanks for the response. I am using SQL query (not Dataframe). Could you
> please explain how I should import this sql function to it? Simply
> importing this class to my driver code does not help here.
>
> Many functions that I need are already there in the sql.functions so I do
> not want to rewrite them.
>
> Regards
> Ashok
>
> On Fri, Mar 4, 2016 at 3:52 PM, ayan guha  > wrote:
>
>> Most likely you are missing import of  org.apache.spark.sql.functions.
>>
>> In any case, you can write your own function for floor and use it as UDF.
>>
>> On Fri, Mar 4, 2016 at 7:34 PM, ashokkumar rajendran <
>> ashokkumar.rajend...@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> I load json file that has timestamp (as long in milliseconds) and
>>> several other attributes. I would like to group them by 5 minutes and store
>>> them as separate file.
>>>
>>> I am facing couple of problems here..
>>> 1. Using Floor function at select clause (to bucket by 5mins) gives me
>>> error saying "java.util.NoSuchElementException: key not found: floor". How
>>> do I use floor function in select clause? I see that floor method is
>>> available in org.apache.spark.sql.functions clause but not sure why its not
>>> working here.
>>> 2. Can I use the same in Group by clause?
>>> 3. How do I store them as separate file after grouping them?
>>>
>>> String logPath = "my-json.gz";
>>> DataFrame logdf = sqlContext.read().json(logPath);
>>> logdf.registerTempTable("logs");
>>> DataFrame bucketLogs = sqlContext.sql("Select `user.timestamp`
>>> as rawTimeStamp, `user.requestId` as requestId,
>>> *floor(`user.timestamp`/72000*) as timeBucket FROM logs");
>>> bucketLogs.toJSON().saveAsTextFile("target_file");
>>>
>>> Regards
>>> Ashok
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Ajay Chander
Thanks for your kind inputs. Right now I am running spark-1.3.1 on YARN(4
node cluster) on a HortonWorks distribution. Now I want to upgrade
spark-1.3.1 to
spark-1.5.1. So at this point of time, do I have to manually go and copy
spark-1.5.1 tarbal to all the nodes or is there any alternative so that I
can get it upgraded through Ambari UI ? If possible can anyone point me to
a documentation online? Thank you.

Regards,
Ajay

On Wednesday, October 21, 2015, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> Hi Frans,
>
> You could download Spark 1.5.1-hadoop 2.6 pre-built tarball and copy into
> HDP 2.3 sandbox or master node. Then copy all the conf files from
> /usr/hdp/current/spark-client/ to your /conf, or you could
> refer to this tech preview (
> http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/
> ), in "installing chapter", step 4 ~ 8 is what you need to do.
>
> Thanks
> Saisai
>
> On Wed, Oct 21, 2015 at 1:27 PM, Frans Thamura <fr...@meruvian.org
> <javascript:_e(%7B%7D,'cvml','fr...@meruvian.org');>> wrote:
>
>> Doug
>>
>> is it possible to put in HDP 2.3?
>>
>> esp in Sandbox
>>
>> can share how do you install it?
>>
>>
>> F
>> --
>> Frans Thamura (曽志胜)
>> Java Champion
>> Shadow Master and Lead Investor
>> Meruvian.
>> Integrated Hypermedia Java Solution Provider.
>>
>> Mobile: +628557888699
>> Blog: http://blogs.mervpolis.com/roller/flatburger (id)
>>
>> FB: http://www.facebook.com/meruvian
>> TW: http://www.twitter.com/meruvian / @meruvian
>> Website: http://www.meruvian.org
>>
>> "We grow because we share the same belief."
>>
>>
>> On Wed, Oct 21, 2015 at 12:24 PM, Doug Balog <doug.sparku...@dugos.com
>> <javascript:_e(%7B%7D,'cvml','doug.sparku...@dugos.com');>> wrote:
>> > I have been running 1.5.1 with Hive in secure mode on HDP 2.2.4 without
>> any problems.
>> >
>> > Doug
>> >
>> >> On Oct 21, 2015, at 12:05 AM, Ajay Chander <itsche...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');>> wrote:
>> >>
>> >> Hi Everyone,
>> >>
>> >> Any one has any idea if spark-1.5.1 is available as a service on
>> HortonWorks ? I have spark-1.3.1 installed on the Cluster and it is a
>> HortonWorks distribution. Now I want upgrade it to spark-1.5.1. Anyone here
>> have any idea about it? Thank you in advance.
>> >>
>> >> Regards,
>> >> Ajay
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');>
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');>
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');>
>> For additional commands, e-mail: user-h...@spark.apache.org
>> <javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');>
>>
>>
>


Spark_sql

2015-10-21 Thread Ajay Chander
Hi Everyone,

I have a use case where I have to create a DataFrame inside the map()
function. To create a DataFrame it need sqlContext or hiveContext. Now how
do I pass the context to my map function ? And I am doing it in java. I
tried creating a class "TestClass" which implements "Function"
and inside the call method I want to create the DataFrame, so I created a
parameterized constructor to pass context from driver program to TestClass
and use that context to create DataFrame. But it seems like it's a wrong
way of doing. Can anyone help me in this? Thanks in advance.

Regards,
Aj


Spark_1.5.1_on_HortonWorks

2015-10-21 Thread Ajay Chander
Hi Sasai,

Thanks for your time. I have followed your inputs and downloaded
"spark-1.5.1-bin-hadoop2.6" on one of the node say node1. And when I did a
pie test everything seems to be working fine, except that the spark-history
-server running on this node1 has gone down. It was complaining about
 missing class:

15/10/21 16:41:28 INFO HistoryServer: Registered signal handlers for [TERM,
HUP, INT]
15/10/21 16:41:28 WARN SparkConf: The configuration key
'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark
1.3 and and may be removed in the future. Please use the new key
'spark.yarn.am.waitTime' instead.
15/10/21 16:41:29 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
15/10/21 16:41:29 INFO SecurityManager: Changing view acls to: root
15/10/21 16:41:29 INFO SecurityManager: Changing modify acls to: root
15/10/21 16:41:29 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
Exception in thread "main" java.lang.ClassNotFoundException:
org.apache.spark.deploy.yarn.history.YarnHistoryProvider
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
at
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:231)
at
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)


I went to the lib folder and noticed that
"spark-assembly-1.5.1-hadoop2.6.0.jar" is missing that class. I was able to
get the spark history server started with 1.3.1 but not 1.5.1. Any inputs
on this?

Really appreciate your help. Thanks

Regards,
Ajay



On Wednesday, October 21, 2015, Saisai Shao <sai.sai.s...@gmail.com
<javascript:_e(%7B%7D,'cvml','sai.sai.s...@gmail.com');>> wrote:

> Hi Ajay,
>
> You don't need to copy tarball to all the nodes, only one node you want to
> run spark application is enough (mostly the master node), Yarn will help to
> distribute the Spark dependencies. The link I mentioned before is the one
> you could follow, please read my previous mail.
>
> Thanks
> Saisai
>
>
>
> On Thu, Oct 22, 2015 at 1:56 AM, Ajay Chander <itsche...@gmail.com> wrote:
>
>> Thanks for your kind inputs. Right now I am running spark-1.3.1 on YARN(4
>> node cluster) on a HortonWorks distribution. Now I want to upgrade 
>> spark-1.3.1 to
>> spark-1.5.1. So at this point of time, do I have to manually go and copy
>> spark-1.5.1 tarbal to all the nodes or is there any alternative so that I
>> can get it upgraded through Ambari UI ? If possible can anyone point me to
>> a documentation online? Thank you.
>>
>> Regards,
>> Ajay
>>
>>
>> On Wednesday, October 21, 2015, Saisai Shao <sai.sai.s...@gmail.com>
>> wrote:
>>
>>> Hi Frans,
>>>
>>> You could download Spark 1.5.1-hadoop 2.6 pre-built tarball and copy
>>> into HDP 2.3 sandbox or master node. Then copy all the conf files from
>>> /usr/hdp/current/spark-client/ to your /conf, or you could
>>> refer to this tech preview (
>>> http://hortonworks.com/hadoop-tutorial/apache-spark-1-4-1-technical-preview-with-hdp/
>>> ), in "installing chapter", step 4 ~ 8 is what you need to do.
>>>
>>> Thanks
>>> Saisai
>>>
>>> On Wed, Oct 21, 2015 at 1:27 PM, Frans Thamura <fr...@meruvian.org>
>>> wrote:
>>>
>>>> Doug
>>>>
>>>> is it possible to put in HDP 2.3?
>>>>
>>>> esp in Sandbox
>>>>
>>>> can share how do you install it?
>>>>
>>>>
>>>> F
>>>> --
>>>> Frans Thamura (曽志胜)
>>>> Java Champion
>>>> Shadow Master and Lead Investor
>>>> Meruvian.
>>>> Integrated Hypermedia Java Solution Provider.
>>>>
>>>> Mobile: +628557888699
>>>> Blog: http://blogs.mervpolis.com/roller/flatburger (id)
>>>>
>>>> FB: http://www.facebook.com/meruvian
>>>> TW: http://www.twitter.com/meruvian / @meruvian
>>>> Website: http://www.meruvian.org
>>>>
>>>> "We grow because we share the same belief."
>>>>

Spark_1.5.1_on_HortonWorks

2015-10-20 Thread Ajay Chander
Hi Everyone,

Any one has any idea if spark-1.5.1 is available as a service on
HortonWorks ? I have spark-1.3.1 installed on the Cluster and it is a
HortonWorks distribution. Now I want upgrade it to spark-1.5.1. Anyone here
have any idea about it? Thank you in advance.

Regards,
Ajay


Re: saveAsTextFile creates an empty folder in HDFS

2015-10-03 Thread Ajay Chander
Hi Jacin,

If I was you, first thing that I would do is, write a sample java
application to write data into hdfs and see if it's working fine. Meta data
is being created in hdfs, that means, communication to namenode is working
fine but not to datanodes since you don't see any data inside the file. Why
don't you see hdfs logs and see what's happening when your application is
talking to namenode? I suspect some networking issue or check if the
datanodes are running fine.

Thank you,
Ajay

On Saturday, October 3, 2015, Jacinto Arias  wrote:

> Yes printing the result with collect or take is working,
>
> actually this is a minimal example, but also when working with real data
> the actions are performed, and the resulting RDDs can be printed out
> without problem. The data is there and the operations are correct, they
> just cannot be written to a file.
>
>
> On 03 Oct 2015, at 16:17, Ted Yu  > wrote:
>
> bq.  val dist = sc.parallelize(l)
>
> Following the above, can you call, e.g. count() on dist before saving ?
>
> Cheers
>
> On Fri, Oct 2, 2015 at 1:21 AM, jarias  > wrote:
>
>> Dear list,
>>
>> I'm experimenting a problem when trying to write any RDD to HDFS. I've
>> tried
>> with minimal examples, scala programs and pyspark programs both in local
>> and
>> cluster modes and as standalone applications or shells.
>>
>> My problem is that when invoking the write command, a task is executed but
>> it just creates an empty folder in the given HDFS path. I'm lost at this
>> point because there is no sign of error or warning in the spark logs.
>>
>> I'm running a seven node cluster managed by cdh5.7, spark 1.3. HDFS is
>> working properly when using the command tools or running MapReduce jobs.
>>
>>
>> Thank you for your time, I'm not sure if this is just a rookie mistake or
>> an
>> overall config problem.
>>
>> Just a working example:
>>
>> This sequence produces the following log and creates the empty folder
>> "test":
>>
>> scala> val l = Seq.fill(1)(nextInt)
>> scala> val dist = sc.parallelize(l)
>> scala> dist.saveAsTextFile("hdfs://node1.i3a.info/user/jarias/test/")
>>
>>
>> 15/10/02 10:19:22 INFO FileOutputCommitter: File Output Committer
>> Algorithm
>> version is 1
>> 15/10/02 10:19:22 INFO SparkContext: Starting job: saveAsTextFile at
>> :27
>> 15/10/02 10:19:22 INFO DAGScheduler: Got job 3 (saveAsTextFile at
>> :27) with 2 output partitions (allowLocal=false)
>> 15/10/02 10:19:22 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile
>> at
>> :27)
>> 15/10/02 10:19:22 INFO DAGScheduler: Parents of final stage: List()
>> 15/10/02 10:19:22 INFO DAGScheduler: Missing parents: List()
>> 15/10/02 10:19:22 INFO DAGScheduler: Submitting Stage 3
>> (MapPartitionsRDD[7]
>> at saveAsTextFile at :27), which has no missing parents
>> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(137336) called with
>> curMem=184615, maxMem=278302556
>> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3 stored as values in
>> memory (estimated size 134.1 KB, free 265.1 MB)
>> 15/10/02 10:19:22 INFO MemoryStore: ensureFreeSpace(47711) called with
>> curMem=321951, maxMem=278302556
>> 15/10/02 10:19:22 INFO MemoryStore: Block broadcast_3_piece0 stored as
>> bytes
>> in memory (estimated size 46.6 KB, free 265.1 MB)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo1.i3a.info:36330 (size: 46.6 KB, free: 265.3 MB)
>> 15/10/02 10:19:22 INFO BlockManagerMaster: Updated info of block
>> broadcast_3_piece0
>> 15/10/02 10:19:22 INFO SparkContext: Created broadcast 3 from broadcast at
>> DAGScheduler.scala:839
>> 15/10/02 10:19:22 INFO DAGScheduler: Submitting 2 missing tasks from
>> Stage 3
>> (MapPartitionsRDD[7] at saveAsTextFile at :27)
>> 15/10/02 10:19:22 INFO YarnScheduler: Adding task set 3.0 with 2 tasks
>> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
>> 6, nodo2.i3a.info, PROCESS_LOCAL, 25975 bytes)
>> 15/10/02 10:19:22 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID
>> 7, nodo3.i3a.info, PROCESS_LOCAL, 25963 bytes)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo2.i3a.info:37759 (size: 46.6 KB, free: 530.2 MB)
>> 15/10/02 10:19:22 INFO BlockManagerInfo: Added broadcast_3_piece0 in
>> memory
>> on nodo3.i3a.info:54798 (size: 46.6 KB, free: 530.2 MB)
>> 15/10/02 10:19:22 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID
>> 6) in 312 ms on nodo2.i3a.info (1/2)
>> 15/10/02 10:19:23 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID
>> 7) in 313 ms on nodo3.i3a.info (2/2)
>> 15/10/02 10:19:23 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks
>> have
>> all completed, from pool
>> 15/10/02 10:19:23 INFO DAGScheduler: Stage 3 (saveAsTextFile at
>> :27) finished in 0.334 s
>> 15/10/02 10:19:23 INFO DAGScheduler: Job 3 

Re: submit_spark_job_to_YARN

2015-08-30 Thread Ajay Chander
Hi David,

Thanks for responding! My main intention was to submit spark Job/jar to
yarn cluster from my eclipse with in the code. Is there any way that I
could pass my yarn configuration somewhere in the code to submit the jar to
the cluster?

Thank you,
Ajay

On Sunday, August 30, 2015, David Mitchell jdavidmitch...@gmail.com wrote:

 Hi Ajay,

 Are you trying to save to your local file system or to HDFS?

 // This would save to HDFS under /user/hadoop/counter
 counter.saveAsTextFile(/user/hadoop/counter);

 David


 On Sun, Aug 30, 2015 at 11:21 AM, Ajay Chander itsche...@gmail.com
 javascript:_e(%7B%7D,'cvml','itsche...@gmail.com'); wrote:

 Hi Everyone,

 Recently we have installed spark on yarn in hortonworks cluster. Now I am
 trying to run a wordcount program in my eclipse and I
 did setMaster(local) and I see the results that's as expected. Now I want
 to submit the same job to my yarn cluster from my eclipse. In storm
 basically I was doing the same by using StormSubmitter class and by passing
 nimbus  zookeeper host to Config object. I was looking for something
 exactly the same.

 When I went through the documentation online, it read that I am suppose
 to export HADOOP_HOME_DIR=path to the conf dir. So now I copied the conf
 folder from one of sparks gateway node to my local Unix box. Now I did
 export that dir...

 export HADOOP_HOME_DIR=/Users/user1/Documents/conf/

 And I did the same in .bash_profile too. Now when I do echo
 $HADOOP_HOME_DIR, I see the path getting printed in the command prompt. Now
 my assumption is, in my program when I change setMaster(local) to
 setMaster(yarn-client) my program should pick up the resource mangers i.e
 yarn cluster info from the directory which I have exported and the job
 should get submitted to resolve manager from my eclipse. But somehow it's
 not happening. Please tell me if my assumption is wrong or if I am missing
 anything here.

 I have attached the word count program that I was using. Any help is
 highly appreciated.

 Thank you,
 Ajay



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-unsubscr...@spark.apache.org');
 For additional commands, e-mail: user-h...@spark.apache.org
 javascript:_e(%7B%7D,'cvml','user-h...@spark.apache.org');




 --
 ### Confidential e-mail, for recipient's (or recipients') eyes only, not
 for distribution. ###



submit_spark_job_to_YARN

2015-08-30 Thread Ajay Chander
Hi Everyone,

Recently we have installed spark on yarn in hortonworks cluster. Now I am
trying to run a wordcount program in my eclipse and I
did setMaster(local) and I see the results that's as expected. Now I want
to submit the same job to my yarn cluster from my eclipse. In storm
basically I was doing the same by using StormSubmitter class and by passing
nimbus  zookeeper host to Config object. I was looking for something
exactly the same.

When I went through the documentation online, it read that I am suppose to
export HADOOP_HOME_DIR=path to the conf dir. So now I copied the conf
folder from one of sparks gateway node to my local Unix box. Now I did
export that dir...

export HADOOP_HOME_DIR=/Users/user1/Documents/conf/

And I did the same in .bash_profile too. Now when I do echo
$HADOOP_HOME_DIR, I see the path getting printed in the command prompt. Now
my assumption is, in my program when I change setMaster(local) to
setMaster(yarn-client) my program should pick up the resource mangers i.e
yarn cluster info from the directory which I have exported and the job
should get submitted to resolve manager from my eclipse. But somehow it's
not happening. Please tell me if my assumption is wrong or if I am missing
anything here.

I have attached the word count program that I was using. Any help is highly
appreciated.

Thank you,
Ajay


submit_spark_job
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: submit_spark_job_to_YARN

2015-08-30 Thread Ajay Chander
Thanks everyone for your valuable time and information. It was helpful.

On Sunday, August 30, 2015, Ted Yu yuzhih...@gmail.com wrote:

 This is related:
 SPARK-10288 Add a rest client for Spark on Yarn

 FYI

 On Sun, Aug 30, 2015 at 12:12 PM, Dawid Wysakowicz 
 wysakowicz.da...@gmail.com
 javascript:_e(%7B%7D,'cvml','wysakowicz.da...@gmail.com'); wrote:

 Hi Ajay,

 In short story: No, there is no easy way to do that. But if you'd like to
 play around this topic a good starting point would be this blog post from
 sequenceIQ: blog
 http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/.

 I heard rumors that there are some work going on to prepare Submit API,
 but I am not a contributor and I can't say neither if it is true nor how
 are the works going on.
 For now the suggested way is to use the provided script: spark-submit.

 Regards
 Dawid

 2015-08-30 20:54 GMT+02:00 Ajay Chander itsche...@gmail.com
 javascript:_e(%7B%7D,'cvml','itsche...@gmail.com');:

 Hi David,

 Thanks for responding! My main intention was to submit spark Job/jar to
 yarn cluster from my eclipse with in the code. Is there any way that I
 could pass my yarn configuration somewhere in the code to submit the jar to
 the cluster?

 Thank you,
 Ajay


 On Sunday, August 30, 2015, David Mitchell jdavidmitch...@gmail.com
 javascript:_e(%7B%7D,'cvml','jdavidmitch...@gmail.com'); wrote:

 Hi Ajay,

 Are you trying to save to your local file system or to HDFS?

 // This would save to HDFS under /user/hadoop/counter
 counter.saveAsTextFile(/user/hadoop/counter);

 David


 On Sun, Aug 30, 2015 at 11:21 AM, Ajay Chander itsche...@gmail.com
 wrote:

 Hi Everyone,

 Recently we have installed spark on yarn in hortonworks cluster. Now I
 am trying to run a wordcount program in my eclipse and I
 did setMaster(local) and I see the results that's as expected. Now I 
 want
 to submit the same job to my yarn cluster from my eclipse. In storm
 basically I was doing the same by using StormSubmitter class and by 
 passing
 nimbus  zookeeper host to Config object. I was looking for something
 exactly the same.

 When I went through the documentation online, it read that I am
 suppose to export HADOOP_HOME_DIR=path to the conf dir. So now I copied
 the conf folder from one of sparks gateway node to my local Unix box. Now 
 I
 did export that dir...

 export HADOOP_HOME_DIR=/Users/user1/Documents/conf/

 And I did the same in .bash_profile too. Now when I do echo
 $HADOOP_HOME_DIR, I see the path getting printed in the command prompt. 
 Now
 my assumption is, in my program when I change setMaster(local) to
 setMaster(yarn-client) my program should pick up the resource mangers 
 i.e
 yarn cluster info from the directory which I have exported and the job
 should get submitted to resolve manager from my eclipse. But somehow it's
 not happening. Please tell me if my assumption is wrong or if I am missing
 anything here.

 I have attached the word count program that I was using. Any help is
 highly appreciated.

 Thank you,
 Ajay



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 ### Confidential e-mail, for recipient's (or recipients') eyes only,
 not for distribution. ###