Spark-SQL 1.6.2 w/Hive UDF @Description

2016-12-23 Thread Lavelle, Shawn
​Hello Spark Users,


I have a Hive UDF that I'm trying to use with Spark-SQL.  It's showing up a 
bit awkwardly:


I can load it into the Hive Thrift Server with a "Create function..." query 
against the hive context.  I can then use the UDF in queries.  However, a "desc 
function " says the function doesn't exist, meanwhile the function is 
loaded into the default table, ie, shows up as default. in a "desc 
functions" call.


Any thoughts as to why this is? Any work arounds?


~ Shawn M Lavelle



[cid:imageab7aa3.GIF@fda4092e.48ab7d50]

Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com



Re: Approach: Incremental data load from HBASE

2016-12-23 Thread Chetan Khatri
Ted Correct, In my case i want Incremental Import from HBASE and
Incremental load to Hive. Both approach discussed earlier with Indexing
seems accurate to me. But like Sqoop support Incremental import and load
for RDBMS, Is there any tool which supports Incremental import from HBase ?



On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu  wrote:

> Incremental load traditionally means generating hfiles and
> using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
> data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0 or
> 1.
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be generic best practice solution for Incremental load
>> from HBASE.
>>
>> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu  wrote:
>>
>>> I haven't used Gobblin.
>>> You can consider asking Gobblin mailing list of the first option.
>>>
>>> The second option would work.
>>>
>>>
>>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Guys,

 I would like to understand different approach for Distributed
 Incremental load from HBase, Is there any *tool / incubactor tool* which
 satisfy requirement ?

 *Approach 1:*

 Write Kafka Producer and maintain manually column flag for events and
 ingest it with Linkedin Gobblin to HDFS / S3.

 *Approach 2:*

 Run Scheduled Spark Job - Read from HBase and do transformations and
 maintain flag column at HBase Level.

 In above both approach, I need to maintain column level flags. such as
 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will
 take another 1000 rows of batch where flag is 0 or 1.

 I am looking for best practice approach with any distributed tool.

 Thanks.

 - Chetan Khatri

>>>
>>>
>>
>


Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
Correct, so the approach you suggested and Uber Jar Approach. What i think
that Uber Jar approach is best practice because if you wish to do
environment migration then would be easy. and Performance wise also Uber
Jar Approach would be more optimised rather than Uber less approach.

Thanks.

On Fri, Dec 23, 2016 at 11:41 PM, Andy Dang  wrote:

> We remodel Spark dependencies and ours together and chuck them under the
> /jars path. There are other ways to do it but we want the classpath to be
> strictly as close to development as possible.
>
> ---
> Regards,
> Andy
>
> On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Andy, Thanks for reply.
>>
>> If we download all the dependencies at separate location  and link with
>> spark job jar on spark cluster, is it best way to execute spark job ?
>>
>> Thanks.
>>
>> On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang  wrote:
>>
>>> I used to use uber jar in Spark 1.x because of classpath issues (we
>>> couldn't re-model our dependencies based on our code, and thus cluster's
>>> run dependencies could be very different from running Spark directly in the
>>> IDE. We had to use userClasspathFirst "hack" to work around this.
>>>
>>> With Spark 2, it's easier to replace dependencies (say, Guava) than
>>> before. We moved away from deploying superjar and just pass the libraries
>>> as part of Spark jars (still can't use Guava v19 or later because Spark
>>> uses a deprecated method that's not available, but that's not a big issue
>>> for us).
>>>
>>> ---
>>> Regards,
>>> Andy
>>>
>>> On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hello Spark Community,

 For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar
 and then submit to spark-submit.

 Example,

 bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob
 /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar

 But other folks has debate with for Uber Less Jar, Guys can you please
 explain me best practice industry standard for the same.

 Thanks,

 Chetan Khatri.

>>>
>>>
>>
>


Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Andy Dang
We remodel Spark dependencies and ours together and chuck them under the
/jars path. There are other ways to do it but we want the classpath to be
strictly as close to development as possible.

---
Regards,
Andy

On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri 
wrote:

> Andy, Thanks for reply.
>
> If we download all the dependencies at separate location  and link with
> spark job jar on spark cluster, is it best way to execute spark job ?
>
> Thanks.
>
> On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang  wrote:
>
>> I used to use uber jar in Spark 1.x because of classpath issues (we
>> couldn't re-model our dependencies based on our code, and thus cluster's
>> run dependencies could be very different from running Spark directly in the
>> IDE. We had to use userClasspathFirst "hack" to work around this.
>>
>> With Spark 2, it's easier to replace dependencies (say, Guava) than
>> before. We moved away from deploying superjar and just pass the libraries
>> as part of Spark jars (still can't use Guava v19 or later because Spark
>> uses a deprecated method that's not available, but that's not a big issue
>> for us).
>>
>> ---
>> Regards,
>> Andy
>>
>> On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Spark Community,
>>>
>>> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and
>>> then submit to spark-submit.
>>>
>>> Example,
>>>
>>> bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob
>>> /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar
>>>
>>> But other folks has debate with for Uber Less Jar, Guys can you please
>>> explain me best practice industry standard for the same.
>>>
>>> Thanks,
>>>
>>> Chetan Khatri.
>>>
>>
>>
>


Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
Andy, Thanks for reply.

If we download all the dependencies at separate location  and link with
spark job jar on spark cluster, is it best way to execute spark job ?

Thanks.

On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang  wrote:

> I used to use uber jar in Spark 1.x because of classpath issues (we
> couldn't re-model our dependencies based on our code, and thus cluster's
> run dependencies could be very different from running Spark directly in the
> IDE. We had to use userClasspathFirst "hack" to work around this.
>
> With Spark 2, it's easier to replace dependencies (say, Guava) than
> before. We moved away from deploying superjar and just pass the libraries
> as part of Spark jars (still can't use Guava v19 or later because Spark
> uses a deprecated method that's not available, but that's not a big issue
> for us).
>
> ---
> Regards,
> Andy
>
> On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Spark Community,
>>
>> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and
>> then submit to spark-submit.
>>
>> Example,
>>
>> bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob
>> /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar
>>
>> But other folks has debate with for Uber Less Jar, Guys can you please
>> explain me best practice industry standard for the same.
>>
>> Thanks,
>>
>> Chetan Khatri.
>>
>
>


Re: Can't access the data in Kafka Spark Streaming globally

2016-12-23 Thread Cody Koeninger
This doesn't sound like a question regarding Kafka streaming, it
sounds like confusion about the scope of variables in spark generally.
Is that right?  If so, I'd suggest reading the documentation, starting
with a simple rdd (e.g. using sparkContext.parallelize), and
experimenting to confirm your understanding.

On Thu, Dec 22, 2016 at 11:46 PM, Sree Eedupuganti  wrote:
> I am trying to stream the data from Kafka to Spark.
>
> JavaPairInputDStream directKafkaStream =
> KafkaUtils.createDirectStream(ssc,
> String.class,
> String.class,
> StringDecoder.class,
> StringDecoder.class,
> kafkaParams, topics);
>
> Here i am iterating over the JavaPairInputDStream to process the RDD's.
>
> directKafkaStream.foreachRDD(rdd ->{
> rdd.foreachPartition(items ->{
> while (items.hasNext()) {
> String[] State = items.next()._2.split("\\,");
>
> System.out.println(State[2]+","+State[3]+","+State[4]+"--");
> };
> });
> });
>
>
> In this i can able to access the String Array but when i am trying to access
> the String Array data globally i can't access the data. Here my requirement
> is if i had access these data globally i had another lookup table in Hive.
> So i am trying to perform an operation on these. Any suggestions please,
> Thanks.
>
>
> --
> Best Regards,
> Sreeharsha Eedupuganti

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-23 Thread Justin Miller
I'm curious about this as well. Seems like the vote passed. 

> On Dec 23, 2016, at 2:00 AM, Aseem Bansal  wrote:
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Andy Dang
I used to use uber jar in Spark 1.x because of classpath issues (we
couldn't re-model our dependencies based on our code, and thus cluster's
run dependencies could be very different from running Spark directly in the
IDE. We had to use userClasspathFirst "hack" to work around this.

With Spark 2, it's easier to replace dependencies (say, Guava) than before.
We moved away from deploying superjar and just pass the libraries as part
of Spark jars (still can't use Guava v19 or later because Spark uses a
deprecated method that's not available, but that's not a big issue for us).

---
Regards,
Andy

On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri 
wrote:

> Hello Spark Community,
>
> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and
> then submit to spark-submit.
>
> Example,
>
> bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob
> /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar
>
> But other folks has debate with for Uber Less Jar, Guys can you please
> explain me best practice industry standard for the same.
>
> Thanks,
>
> Chetan Khatri.
>


ThreadPoolExecutor - slow spark job

2016-12-23 Thread geoHeil
Hi,

I built a spark job which is very slow.
ThreadPoolExecutor is executed for every second task of my custom spark
pipeline step.

Additionally, I noticed that spark is spending a lot of the time in the
garbage collection and sometimes 0 tasks are launched but still the driver
is waiting

I put it up here
http://stackoverflow.com/questions/41298550/spark-threadpoolexecutor-very-often-called-in-tasks
as well with a minimal example of
https://github.com/geoHeil/sparkContrastCoding

Looking forward to any input to speed up this spark job.

cheers,
Georg



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ThreadPoolExecutor-slow-spark-job-tp28248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Dependency Injection and Microservice development with Spark

2016-12-23 Thread Chetan Khatri
Hello Community,

Current approach I am using for Spark Job Development with Scala + SBT and
Uber Jar with yml properties file to pass configuration parameters. But If
i would like to use Dependency Injection and MicroService Development like
Spring Boot feature in Scala then what would be the standard approach.

Thanks

Chetan


Re: parsing embedded json in spark

2016-12-23 Thread Tal Grynbaum
Hi Shaw,

Thanks, that works!



On Thu, Dec 22, 2016 at 6:45 PM, Shaw Liu  wrote:

> Hi,I guess you can use 'get_json_object' function
>
> Get Outlook for iOS 
>
>
>
>
> On Thu, Dec 22, 2016 at 9:52 PM +0800, "Irving Duran" <
> irving.du...@gmail.com> wrote:
>
> Is it an option to parse that field prior of creating the dataframe? If
>> so, that's what I would do.
>>
>> In terms of your master node only working, you have to share more about
>> your structure, are you using spark standalone, yarn, or mesos?
>>
>>
>> Thank You,
>>
>> Irving Duran
>>
>> On Thu, Dec 22, 2016 at 1:42 AM, Tal Grynbaum 
>> wrote:
>>
>>> Hi,
>>>
>>> I have a dataframe that contain an embedded json string in one of the
>>> fields
>>> I'd tried to write a UDF function that will parse it using lift-json,
>>> but it seems to take a very long time to process, and it seems that only
>>> the master node is working.
>>>
>>> Has anyone dealt with such a scenario before and can give me some hints?
>>>
>>> Thanks
>>> Tal
>>>
>>
>>


-- 
*Tal Grynbaum* / *CTO & co-founder*

m# +972-54-7875797

mobile retention done right


Re: Ingesting data in elasticsearch from hdfs using spark , cluster setup and usage

2016-12-23 Thread Anastasios Zouzias
Hi Rohit,

Since your instances have 16G dual core only, I would suggest to use
dedicated nodes for elastic using 8GB for elastic heap memory. This way you
won't have any interference between spark executors and elastic.

Also, if possible, you could try to use SSD disk on these 3 machines for
storing the elastic indices; this will boost your elastic cluster
performance.

Best,
Anastasios

On Thu, Dec 22, 2016 at 6:35 PM, Rohit Verma 
wrote:

> I am setting up a spark cluster. I have hdfs data nodes and spark master
> nodes on same instances. To add elasticsearch to this cluster, should I
> spawn es on different machine on same machine. I have only 12 machines,
> 1-master (spark and hdfs)
> 8-spark workers and hdfs data nodes
> I can use 3 nodes for es dedicatedly or can use 11 nodes running all three.
>
> All instances are same, 16gig dual core (unfortunately).
>
> Also I am trying with es hadoop, es-spark project but I felt ingestion is
> very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
> If any one had experience using that project can you please share your
> thoughts about tuning.
>
> Regards
> Rohit
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
-- Anastasios Zouzias



Is there any scheduled release date for Spark 2.1.0?

2016-12-23 Thread Aseem Bansal