Re: SparkAppHandle.Listener.infoChanged behaviour

2017-06-05 Thread Mohammad Tariq
Hi Marcelo,

Thank you so much for the response. Appreciate it!


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig_source=product_medium=email_sig_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Mon, Jun 5, 2017 at 7:24 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Sat, Jun 3, 2017 at 7:16 PM, Mohammad Tariq <donta...@gmail.com> wrote:
> > I am having a bit of difficulty in understanding the exact behaviour of
> > SparkAppHandle.Listener.infoChanged(SparkAppHandle handle) method. The
> > documentation says :
> >
> > Callback for changes in any information that is not the handle's state.
> >
> > What exactly is meant by any information here? Apart from state other
> pieces
> > of information I can see is ID
>
> So, you answered your own question.
>
> If there's ever any new kind of information, it would use the same event.
>
> --
> Marcelo
>


SparkAppHandle.Listener.infoChanged behaviour

2017-06-03 Thread Mohammad Tariq
Dear fellow Spark users,

I am having a bit of difficulty in understanding the exact behaviour
of *SparkAppHandle.Listener.infoChanged(SparkAppHandle
handle)* method. The documentation says :

*Callback for changes in any information that is not the handle's state.*

What exactly is meant by *any information* here? Apart from state other
pieces of information I can see is *ID* and *isFinal*. I can't see anything
else. What's that thing which actually triggers *infoChanged* method? I can
clearly see *stateChanged* getting called at each state change during the
course of the execution of my program. However, *infoChanged* gets called
just once, when the state transitions from *SUBMITTED* to *RUNNING*.

Does infoChanged gets called only once, when state transitions
from SUBMITTED to RUNNING? Or is it just in my case?

Thank you so much for your valuable time. Appreciate it!

Best,
Tariq


Application not found in RM

2017-04-17 Thread Mohammad Tariq
Dear fellow Spark users,

*Use case :* I have written a small java client which launches multiple
Spark jobs through *SparkLauncher* and captures jobs' metrics during the
course of the execution.

*Issue :* Sometimes the client fails saying -
*Caused by:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException):
Application with id 'application_APP_ID' doesn't exist in RM.*

I am using *YarnClient.getApplicationReport(ApplicationID ID)* to get the
desired metrics. I forced the threads to sleep for sometime so that
applications actually gets started before I query for these metrics. Most
of the times it works. However, I feel this is not the correct approach.

What could be the ideal way to handle such situation?

Thank you so much for your valuable time!





[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]




[image: --]

Tariq, Mohammad
[image: https://]about.me/mti



Intermittent issue while running Spark job through SparkLauncher

2017-03-25 Thread Mohammad Tariq
Dear fellow Spark users,

I have a multithreaded Java program which launches multiple Spark jobs in
parallel through *SparkLauncher* API. It also monitors these Spark jobs and
keeps on updating information like job start/end time, current state,
tracking url etc in an audit table. To get these information I am making
use of *YarnClient.getApplicationReport(ApplicationId)* API. The
ApplicationId used here is retrieved through
*SparkAppHandle.getAppId()* However,
I have noticed that sometimes my program fails saying :

*org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException:
Application with id 'application_1490281320520_107048' doesn't exist in RM.*

This is an intermittent problem and most of the times it runs successfully.
But when it fails, it fails twice/thrice successively. After re-running the
program repeatedly it returns to normal.

Has anyone else faced this issue? Any help is greatly appreciated. Thank
you so much for your valuable time!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]




[image: --]

Tariq, Mohammad
[image: https://]about.me/mti



Re: using spark to load a data warehouse in real time

2017-02-28 Thread Mohammad Tariq
You could try this as a blueprint :

Read the data in through Spark Streaming. Iterate over it and convert each
RDD into a DataFrame. Use these DataFrames to perform whatever processing
is required and then save that DataFrame into your target relational
warehouse.

HTH


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig_source=product_medium=email_sig_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <donta...@gmail.com> wrote:

> Hi Adaryl,
>
> You could definitely load data into a warehouse through Spark's JDBC
> support through DataFrames. Could you please explain your use case a bit
> more? That'll help us in answering your query better.
>
>
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> <https://about.me/mti?promo=email_sig_source=product_medium=email_sig_campaign=chrome_ext>
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <
> adaryl.wakefi...@hotmail.com> wrote:
>
>> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would,
>> of course have to be in any architecture but it looks like they are
>> suggesting that Kafka is all you need.
>>
>>
>>
>> My primary concern is the complexity of loading warehouses. I have a web
>> development background so I have somewhat of an idea on how to insert data
>> into a database from an application. I’ve since moved on to straight
>> database programming and don’t work with anything that reads from an app
>> anymore.
>>
>>
>>
>> Loading a warehouse requires a lot of cleaning of data and running and
>> grabbing keys to maintain referential integrity. Usually that’s done in a
>> batch process. Now I have to do it record by record (or a few records). I
>> have some ideas but I’m not quite there yet.
>>
>>
>>
>> I thought SparkSQL would be the way to get this done but so far, all the
>> examples I’ve seen are just SELECT statements, no INSERTS or MERGE
>> statements.
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>>
>> www.massstreet.net
>>
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>> *From:* Femi Anthony [mailto:femib...@gmail.com]
>> *Sent:* Tuesday, February 28, 2017 4:13 AM
>> *To:* Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: using spark to load a data warehouse in real time
>>
>>
>>
>> Have you checked to see if there are any drivers to enable you to write
>> to Greenplum directly from Spark ?
>>
>>
>>
>> You can also take a look at this link:
>>
>>
>>
>> https://groups.google.com/a/greenplum.org/forum/m/#!topic/gp
>> db-users/lnm0Z7WBW6Q
>>
>>
>>
>> Apparently GPDB is based on Postgres so maybe that approach may work.
>>
>> Another approach maybe for Spark Streaming to write to Kafka, and then
>> have another process read from Kafka and write to Greenplum.
>>
>>
>>
>> Kafka Connect may be useful in this case -
>>
>>
>>
>> https://www.confluent.io/blog/announcing-kafka-connect-build
>> ing-large-scale-low-latency-data-pipelines/
>>
>>
>>
>> Femi Anthony
>>
>>
>>
>>
>>
>>
>> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <
>> adaryl.wakefi...@hotmail.com> wrote:
>>
>> Is anybody using Spark streaming/SQL to load a relational data warehouse
>> in real time? There isn’t a lot of information on this use case out there.
>> When I google real time data warehouse load, nothing I find is up to date.
>> It’s all turn of the century stuff and doesn’t take into account
>> advancements in database technology. Additionally, whenever I try to learn
>> spark, it’s always the same thing. Play with twitter data never structured
>> data. All the CEP uses cases are about data science.
>>
>>
>>
>> I’d like to use Spark to load Greenplumb in real time. Intuitively, this
>> should be possible. I was thinking Spark Streaming with Spark SQL along
>> with a ORM should do it. Am I off base with this? Is the reason why there
>> are no examples is because there is a better way to do what I want?
>>
>>
>>
>> Adaryl "Bob" Wakefield, MBA
>> Principal
>> Mass Street Analytics, LLC
>> 913.938.6685
>>
>> www.massstreet.net
>>
>> www.linkedin.com/in/bobwakefieldmba
>> Twitter: @BobLovesData
>>
>>
>>
>>
>


Re: using spark to load a data warehouse in real time

2017-02-28 Thread Mohammad Tariq
Hi Adaryl,

You could definitely load data into a warehouse through Spark's JDBC
support through DataFrames. Could you please explain your use case a bit
more? That'll help us in answering your query better.




[image: --]

Tariq, Mohammad
[image: https://]about.me/mti





[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield <
adaryl.wakefi...@hotmail.com> wrote:

> I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would,
> of course have to be in any architecture but it looks like they are
> suggesting that Kafka is all you need.
>
>
>
> My primary concern is the complexity of loading warehouses. I have a web
> development background so I have somewhat of an idea on how to insert data
> into a database from an application. I’ve since moved on to straight
> database programming and don’t work with anything that reads from an app
> anymore.
>
>
>
> Loading a warehouse requires a lot of cleaning of data and running and
> grabbing keys to maintain referential integrity. Usually that’s done in a
> batch process. Now I have to do it record by record (or a few records). I
> have some ideas but I’m not quite there yet.
>
>
>
> I thought SparkSQL would be the way to get this done but so far, all the
> examples I’ve seen are just SELECT statements, no INSERTS or MERGE
> statements.
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
> *From:* Femi Anthony [mailto:femib...@gmail.com]
> *Sent:* Tuesday, February 28, 2017 4:13 AM
> *To:* Adaryl Wakefield 
> *Cc:* user@spark.apache.org
> *Subject:* Re: using spark to load a data warehouse in real time
>
>
>
> Have you checked to see if there are any drivers to enable you to write to
> Greenplum directly from Spark ?
>
>
>
> You can also take a look at this link:
>
>
>
> https://groups.google.com/a/greenplum.org/forum/m/#!topic/
> gpdb-users/lnm0Z7WBW6Q
>
>
>
> Apparently GPDB is based on Postgres so maybe that approach may work.
>
> Another approach maybe for Spark Streaming to write to Kafka, and then
> have another process read from Kafka and write to Greenplum.
>
>
>
> Kafka Connect may be useful in this case -
>
>
>
> https://www.confluent.io/blog/announcing-kafka-connect-
> building-large-scale-low-latency-data-pipelines/
>
>
>
> Femi Anthony
>
>
>
>
>
>
> On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <
> adaryl.wakefi...@hotmail.com> wrote:
>
> Is anybody using Spark streaming/SQL to load a relational data warehouse
> in real time? There isn’t a lot of information on this use case out there.
> When I google real time data warehouse load, nothing I find is up to date.
> It’s all turn of the century stuff and doesn’t take into account
> advancements in database technology. Additionally, whenever I try to learn
> spark, it’s always the same thing. Play with twitter data never structured
> data. All the CEP uses cases are about data science.
>
>
>
> I’d like to use Spark to load Greenplumb in real time. Intuitively, this
> should be possible. I was thinking Spark Streaming with Spark SQL along
> with a ORM should do it. Am I off base with this? Is the reason why there
> are no examples is because there is a better way to do what I want?
>
>
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics, LLC
> 913.938.6685
>
> www.massstreet.net
>
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>
>
>


Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Mohammad Tariq
Hi Karim,

Are you looking for something specific? Some information about your usecase
would be really  helpful in order to answer your question.

On Wednesday, November 16, 2016, Karim, Md. Rezaul <
rezaul.ka...@insight-centre.org> wrote:

> Hi All,
>
> I am completely new with Kafka. I was wondering if somebody could provide
> me some guidelines on how to develop real-time streaming applications using
> Spark Streaming API with Kafka.
>
> I am aware the Spark Streaming  and Kafka integration [1]. However, a real
> life example should be better to start?
>
>
>
> 1. http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
>
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>


-- 


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
Sure, will look into the tests.

Thanks you so much for your time!


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Fri, Nov 11, 2016 at 4:35 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Sorry, it's kinda hard to give any more feedback from just the info you
> provided.
>
> I'd start with some working code like this from Spark's own unit tests:
> https://github.com/apache/spark/blob/a8ea4da8d04c1ed621a96668118f20
> 739145edd2/yarn/src/test/scala/org/apache/spark/deploy/
> yarn/YarnClusterSuite.scala#L164
>
>
> On Thu, Nov 10, 2016 at 3:00 PM, Mohammad Tariq <donta...@gmail.com>
> wrote:
>
>> All I want to do is submit a job, and keep on getting states as soon as
>> it changes, and come out once the job is over. I'm sorry to be a pest of
>> questions. Kind of having a bit of tough time making this work.
>>
>>
>> [image: --]
>>
>> Tariq, Mohammad
>> [image: https://]about.me/mti
>>
>> <https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>
>>
>>
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>> On Fri, Nov 11, 2016 at 4:27 AM, Mohammad Tariq <donta...@gmail.com>
>> wrote:
>>
>>> Yeah, that definitely makes sense. I was just trying to make it work
>>> somehow. The problem is that it's not at all calling the listeners, hence
>>> i'm unable to do anything. Just wanted to cross check it by looping inside.
>>> But I get the point. thank you for that!
>>>
>>> I'm on YARN(cluster mode).
>>>
>>>
>>> [image: --]
>>>
>>> Tariq, Mohammad
>>> [image: https://]about.me/mti
>>>
>>> <https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>
>>>
>>>
>>>
>>>
>>> [image: http://]
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>> <http://about.me/mti>
>>>
>>>
>>> On Fri, Nov 11, 2016 at 4:19 AM, Marcelo Vanzin <van...@cloudera.com>
>>> wrote:
>>>
>>>> On Thu, Nov 10, 2016 at 2:43 PM, Mohammad Tariq <donta...@gmail.com>
>>>> wrote:
>>>> >   @Override
>>>> >   public void stateChanged(SparkAppHandle handle) {
>>>> > System.out.println("Spark App Id [" + handle.getAppId() + "].
>>>> State [" + handle.getState() + "]");
>>>> > while(!handle.getState().isFinal()) {
>>>>
>>>> You shouldn't loop in an event handler. That's not really how
>>>> listeners work. Instead, use the event handler to update some local
>>>> state, or signal some thread that's waiting for the state change.
>>>>
>>>> Also be aware that handles currently only work in local and yarn
>>>> modes; the state updates haven't been hooked up to standalone mode
>>>> (maybe for client mode, but definitely not cluster) nor mesos.
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>
>>>
>>
>
>
> --
> Marcelo
>


Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
All I want to do is submit a job, and keep on getting states as soon as it
changes, and come out once the job is over. I'm sorry to be a pest of
questions. Kind of having a bit of tough time making this work.


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Fri, Nov 11, 2016 at 4:27 AM, Mohammad Tariq <donta...@gmail.com> wrote:

> Yeah, that definitely makes sense. I was just trying to make it work
> somehow. The problem is that it's not at all calling the listeners, hence
> i'm unable to do anything. Just wanted to cross check it by looping inside.
> But I get the point. thank you for that!
>
> I'm on YARN(cluster mode).
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> <https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>
>
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Fri, Nov 11, 2016 at 4:19 AM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
>
>> On Thu, Nov 10, 2016 at 2:43 PM, Mohammad Tariq <donta...@gmail.com>
>> wrote:
>> >   @Override
>> >   public void stateChanged(SparkAppHandle handle) {
>> > System.out.println("Spark App Id [" + handle.getAppId() + "]. State
>> [" + handle.getState() + "]");
>> > while(!handle.getState().isFinal()) {
>>
>> You shouldn't loop in an event handler. That's not really how
>> listeners work. Instead, use the event handler to update some local
>> state, or signal some thread that's waiting for the state change.
>>
>> Also be aware that handles currently only work in local and yarn
>> modes; the state updates haven't been hooked up to standalone mode
>> (maybe for client mode, but definitely not cluster) nor mesos.
>>
>> --
>> Marcelo
>>
>
>


Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
Yeah, that definitely makes sense. I was just trying to make it work
somehow. The problem is that it's not at all calling the listeners, hence
i'm unable to do anything. Just wanted to cross check it by looping inside.
But I get the point. thank you for that!

I'm on YARN(cluster mode).


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Fri, Nov 11, 2016 at 4:19 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Thu, Nov 10, 2016 at 2:43 PM, Mohammad Tariq <donta...@gmail.com>
> wrote:
> >   @Override
> >   public void stateChanged(SparkAppHandle handle) {
> > System.out.println("Spark App Id [" + handle.getAppId() + "]. State
> [" + handle.getState() + "]");
> > while(!handle.getState().isFinal()) {
>
> You shouldn't loop in an event handler. That's not really how
> listeners work. Instead, use the event handler to update some local
> state, or signal some thread that's waiting for the state change.
>
> Also be aware that handles currently only work in local and yarn
> modes; the state updates haven't been hooked up to standalone mode
> (maybe for client mode, but definitely not cluster) nor mesos.
>
> --
> Marcelo
>


Re: Correct SparkLauncher usage

2016-11-10 Thread Mohammad Tariq
Hi Marcelo,

After a few changes I got it working. However I could not understand one
thing. I need to call Thread.sleep() and then get the state explicitly in
order to make it work.

Also, no matter what I do my launcher program doesn't call stateChanged()
or infoChanged(). Here is my code :

public class RMLauncher implements SparkAppHandle.Listener {

  public static void main(String[] args) {

Map<String, String> map = new HashMap<>();
map.put("HADOOP_CONF_DIR", "/etc/hadoop/conf");
map.put("KRB5CCNAME", "/tmp/sparkjob");
map.put("SPARK_PRINT_LAUNCH_COMMAND", "1");
launchSparkJob(map);
  }

  public static void launchSparkJob(Map<String, String> map) {
SparkAppHandle handle = null;
try {
  handle = new SparkLauncher(map).startApplication();
} catch (IOException e) {
  e.printStackTrace();
}
  }

  @Override
  public void stateChanged(SparkAppHandle handle) {
System.out.println("Spark App Id [" + handle.getAppId() + "]. State ["
+ handle.getState() + "]");
while(!handle.getState().isFinal()) {
  System.out.println(">>>>>>>> state is not final yet");
  System.out.println(">>>>>>>> sleeping for a second");
  try {
Thread.sleep(1000L);
  } catch (InterruptedException e) {
  }
}
  }

  @Override
  public void infoChanged(SparkAppHandle handle) {
System.out.println("Spark App Id [" + handle.getAppId() + "] State
Changed. State [" + handle.getState() + "]");
  }
}

I have set all the required properties and I'm able to submit and run spark
jobs successfully. Any pointers would be really helpful.

Thanks again!


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Tue, Nov 8, 2016 at 5:16 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> Then you need to look at your logs to figure out why the child app is not
> working. "startApplication" will by default redirect the child's output to
> the parent's logs.
>
> On Mon, Nov 7, 2016 at 3:42 PM, Mohammad Tariq <donta...@gmail.com> wrote:
>
>> Hi Marcelo,
>>
>> Thank you for the prompt response. I tried adding listeners as well,
>> didn't work either. Looks like it isn't starting the job at all.
>>
>>
>> [image: --]
>>
>> Tariq, Mohammad
>> [image: https://]about.me/mti
>>
>> <https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>
>>
>>
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>> On Tue, Nov 8, 2016 at 5:06 AM, Marcelo Vanzin <van...@cloudera.com>
>> wrote:
>>
>>> On Mon, Nov 7, 2016 at 3:29 PM, Mohammad Tariq <donta...@gmail.com>
>>> wrote:
>>> > I have been trying to use SparkLauncher.startApplication() to launch
>>> a Spark app from within java code, but unable to do so. However, same piece
>>> of code is working if I use SparkLauncher.launch().
>>> >
>>> > Here are the corresponding code snippets :
>>> >
>>> > SparkAppHandle handle = new SparkLauncher()
>>> >
>>> > .setSparkHome("/Users/miqbal1/DISTRIBUTED_WORLD/UNPACKED/sp
>>> ark-1.6.1-bin-hadoop2.6")
>>> >
>>> > .setJavaHome("/Library/Java/JavaVirtualMachines/jdk1.8.0_92
>>> .jdk/Contents/Home")
>>> >
>>> > .setAppResource("/Users/miqbal1/wc.jar").setMainClass("org.
>>> myorg.WC").setMaster("local")
>>> >
>>> > .setConf("spark.dynamicAllocation.enabled",
>>> "true").startApplication();System.out.println(handle.getAppId());
>>> >
>>> > System.out.println(handle.getState());
>>> >
>>> > This prints null and UNKNOWN as output.
>>>
>>> The information you're printing is not available immediately after you
>>> call "startApplication()". The Spark app is still starting, so it may
>>> take some time for the app ID and other info to be reported back. The
>>> "startApplication()" method allows you to provide listeners you can
>>> use to know when that information is available.
>>>
>>> --
>>> Marcelo
>>>
>>
>>
>
>
> --
> Marcelo
>


Re: Correct SparkLauncher usage

2016-11-07 Thread Mohammad Tariq
Hi Marcelo,

Thank you for the prompt response. I tried adding listeners as well, didn't
work either. Looks like it isn't starting the job at all.


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>




[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Tue, Nov 8, 2016 at 5:06 AM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Mon, Nov 7, 2016 at 3:29 PM, Mohammad Tariq <donta...@gmail.com> wrote:
> > I have been trying to use SparkLauncher.startApplication() to launch a
> Spark app from within java code, but unable to do so. However, same piece
> of code is working if I use SparkLauncher.launch().
> >
> > Here are the corresponding code snippets :
> >
> > SparkAppHandle handle = new SparkLauncher()
> >
> > .setSparkHome("/Users/miqbal1/DISTRIBUTED_WORLD/
> UNPACKED/spark-1.6.1-bin-hadoop2.6")
> >
> > .setJavaHome("/Library/Java/JavaVirtualMachines/jdk1.8.0_92
> .jdk/Contents/Home")
> >
> > .setAppResource("/Users/miqbal1/wc.jar").setMainClass("org.
> myorg.WC").setMaster("local")
> >
> > .setConf("spark.dynamicAllocation.enabled",
> "true").startApplication();System.out.println(handle.getAppId());
> >
> > System.out.println(handle.getState());
> >
> > This prints null and UNKNOWN as output.
>
> The information you're printing is not available immediately after you
> call "startApplication()". The Spark app is still starting, so it may
> take some time for the app ID and other info to be reported back. The
> "startApplication()" method allows you to provide listeners you can
> use to know when that information is available.
>
> --
> Marcelo
>


Correct SparkLauncher usage

2016-11-07 Thread Mohammad Tariq
Dear fellow Spark users,

I have been trying to use *SparkLauncher.startApplication()* to launch a
Spark app from within java code, but unable to do so. However, same piece
of code is working if I use *SparkLauncher.launch()*.

Here are the corresponding code snippets :

*SparkAppHandle handle = new SparkLauncher()*

*
.setSparkHome("/Users/miqbal1/DISTRIBUTED_WORLD/UNPACKED/spark-1.6.1-bin-hadoop2.6")*

*
.setJavaHome("/Library/Java/JavaVirtualMachines/jdk1.8.0_92.jdk/Contents/Home")*

*
.setAppResource("/Users/miqbal1/wc.jar").setMainClass("org.myorg.WC").setMaster("local")*

*.setConf("spark.dynamicAllocation.enabled",
"true").startApplication();System.out.println(handle.getAppId());*

*System.out.println(handle.getState());*

This prints *null* and *UNKNOWN *as output.

*Process spark = new SparkLauncher()*

*
.setSparkHome("/Users/miqbal1/DISTRIBUTED_WORLD/UNPACKED/spark-1.6.1-bin-hadoop2.6")*

*
.setJavaHome("/Library/Java/JavaVirtualMachines/jdk1.8.0_92.jdk/Contents/Home")*

*
.setAppResource("/Users/miqbal1/wc.jar").setMainClass("org.myorg.WC").setMaster("local").launch();*

*spark.waitFor();*

*InputStreamReaderRunnable inputStreamReaderRunnable = new
InputStreamReaderRunnable(spark.getInputStream(),*

*"input");*

*Thread inputThread = new Thread(inputStreamReaderRunnable,
"LogStreamReader input");*

*inputThread.start();*


*InputStreamReaderRunnable errorStreamReaderRunnable = new
InputStreamReaderRunnable(spark.getErrorStream(),*

*"error");*

*Thread errorThread = new Thread(errorStreamReaderRunnable,
"LogStreamReader error");*

*errorThread.start();*


*System.out.println("Waiting for finish...");*

*int exitCode = spark.waitFor();*

*System.out.println("Finished! Exit code:" + exitCode);*

While this works perfectly fine.

Any pointers would be really helpful.


Thank you!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]




[image: --]

Tariq, Mohammad
[image: https://]about.me/mti



Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Mohammad Tariq
Hi Divya,

Do you you have inbounds enabled on port 50070 of your NN machine. Also,
it's a good idea to have the public DNS in your /etc/hosts for proper name
resolution.


[image: --]

Tariq, Mohammad
[image: https://]about.me/mti





[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Tue, Sep 13, 2016 at 9:28 AM, Divya Gehlot 
wrote:

> Hi,
> I am on EMR 4.7 with Spark 1.6.1   and Hadoop 2.7.2
> When I am trying to view Any of the web UI of the cluster either hadoop or
> Spark ,I am getting below error
> "
> This site can’t be reached
>
> "
> Has anybody using EMR and able to view WebUI .
> Could you please share the steps.
>
> Would really appreciate the help.
>
> Thanks,
> Divya
>


Re: Is spark-1.6.1-bin-2.6.0 compatible with hive-1.1.0-cdh5.7.1

2016-07-28 Thread Mohammad Tariq
Hi Mich,

Thank you so much for the prompt response!

I do have a copy of hive-site.xml in spark conf directory.

On Thursday, July 28, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> This line
>
> 2016-07-28 04:36:01,814] INFO Property hive.metastore.integral.jdo.pushdown
> unknown - will be ignored (DataNucleus.Persistence:77)
>
> telling me that you do don't seem to have the softlink to hive-site.xml
> in $SPARK_HOME/conf
>
> hive-site.xml -> /usr/lib/hive/conf/hive-site.xml
>
> I suggest you check that. That is the reason it cannot find you Hive
> metastore
>
>
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 12:45, Mohammad Tariq <donta...@gmail.com
> <javascript:_e(%7B%7D,'cvml','donta...@gmail.com');>> wrote:
>
>> Could anyone please help me with this? I have been using the same version
>> of Spark with CDH-5.4.5 successfully so far. However after a recent CDH
>> upgrade I'm not able to run the same Spark SQL module against
>> hive-1.1.0-cdh5.7.1.
>>
>> When I try to run my program Spark tries to connect to local derby Hive
>> metastore instead of the configured MySQL metastore. I have all the
>> required jars along with hive-site.xml in place though. There is no change
>> in the setup.
>>
>> This is the exception which I'm getting :
>>
>> [2016-07-28 04:36:01,207] INFO Initializing execution hive, version 1.2.1
>> (org.apache.spark.sql.hive.HiveContext:58)
>>
>> [2016-07-28 04:36:01,231] INFO Inspected Hadoop version: 2.6.0-cdh5.7.1
>> (org.apache.spark.sql.hive.client.ClientWrapper:58)
>>
>> [2016-07-28 04:36:01,232] INFO Loaded
>> org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
>> 2.6.0-cdh5.7.1 (org.apache.spark.sql.hive.client.ClientWrapper:58)
>>
>> [2016-07-28 04:36:01,520] INFO 0: Opening raw store with implemenation
>> class:org.apache.hadoop.hive.metastore.ObjectStore
>> (org.apache.hadoop.hive.metastore.HiveMetaStore:638)
>>
>> [2016-07-28 04:36:01,548] INFO ObjectStore, initialize called
>> (org.apache.hadoop.hive.metastore.ObjectStore:332)
>>
>> [2016-07-28 04:36:01,814] INFO Property
>> hive.metastore.integral.jdo.pushdown unknown - will be ignored
>> (DataNucleus.Persistence:77)
>>
>> [2016-07-28 04:36:01,815] INFO Property datanucleus.cache.level2 unknown
>> - will be ignored (DataNucleus.Persistence:77)
>>
>> [2016-07-28 04:36:02,417] WARN Retrying creating default database after
>> error: Unexpected exception caught.
>> (org.apache.hadoop.hive.metastore.HiveMetaStore:671)
>>
>> javax.jdo.JDOFatalInternalException: Unexpected exception caught.
>>
>> at
>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)
>>
>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>>
>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:410)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:439)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:334)
>>
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:290)
>>
>> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>>
>> at
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>>
>> at
>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>>
>> at
>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:642)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:620)
>>
>> at
>> org.apache.hadoop.hive.metasto

Is spark-1.6.1-bin-2.6.0 compatible with hive-1.1.0-cdh5.7.1

2016-07-28 Thread Mohammad Tariq
Could anyone please help me with this? I have been using the same version
of Spark with CDH-5.4.5 successfully so far. However after a recent CDH
upgrade I'm not able to run the same Spark SQL module against
hive-1.1.0-cdh5.7.1.

When I try to run my program Spark tries to connect to local derby Hive
metastore instead of the configured MySQL metastore. I have all the
required jars along with hive-site.xml in place though. There is no change
in the setup.

This is the exception which I'm getting :

[2016-07-28 04:36:01,207] INFO Initializing execution hive, version 1.2.1
(org.apache.spark.sql.hive.HiveContext:58)

[2016-07-28 04:36:01,231] INFO Inspected Hadoop version: 2.6.0-cdh5.7.1
(org.apache.spark.sql.hive.client.ClientWrapper:58)

[2016-07-28 04:36:01,232] INFO Loaded
org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version
2.6.0-cdh5.7.1 (org.apache.spark.sql.hive.client.ClientWrapper:58)

[2016-07-28 04:36:01,520] INFO 0: Opening raw store with implemenation
class:org.apache.hadoop.hive.metastore.ObjectStore
(org.apache.hadoop.hive.metastore.HiveMetaStore:638)

[2016-07-28 04:36:01,548] INFO ObjectStore, initialize called
(org.apache.hadoop.hive.metastore.ObjectStore:332)

[2016-07-28 04:36:01,814] INFO Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
(DataNucleus.Persistence:77)

[2016-07-28 04:36:01,815] INFO Property datanucleus.cache.level2 unknown -
will be ignored (DataNucleus.Persistence:77)

[2016-07-28 04:36:02,417] WARN Retrying creating default database after
error: Unexpected exception caught.
(org.apache.hadoop.hive.metastore.HiveMetaStore:671)

javax.jdo.JDOFatalInternalException: Unexpected exception caught.

at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1193)

at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)

at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)

at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:410)

at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:439)

at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:334)

at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:290)

at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)

at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)

at
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)

at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:642)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:620)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:669)

at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:478)

at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:78)

at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)

at
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5903)

at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:198)

at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

at java.lang.reflect.Constructor.newInstance(Constructor.java:423)

at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1491)

at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:67)

at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:82)

at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2935)

at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2954)

at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:513)

at
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)

at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)

at
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)

at
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)

at
org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)

at
org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)

at org.apache.spark.sql.UDFRegistration.(UDFRegistration.scala:40)

at org.apache.spark.sql.SQLContext.(SQLContext.scala:330)

at 

Re: Recommended way to push data into HBase through Spark streaming

2016-06-16 Thread Mohammad Tariq
Forgot to add, I'm on HBase 1.0.0-cdh5.4.5, so can't use HBaseContext. And
spark version is 1.6.1



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Thu, Jun 16, 2016 at 10:12 PM, Mohammad Tariq <donta...@gmail.com> wrote:

> Hi group,
>
> I have a streaming job which reads data from Kafka, performs some
> computation and pushes the result into HBase. Actually the results are
> pushed into 3 different HBase tables. So I was wondering what could be the
> best way to achieve this.
>
> Since each executor will open its own HBase connection and write data to a
> regionserver independent of rest of the executors I feel this is a bit of
> overkill. How about collecting the results of each micro batch and putting
> them in one shot at the end of that batch?
>
> If so what should be the way to go about this?
>
> Many thanks!
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>


Recommended way to push data into HBase through Spark streaming

2016-06-16 Thread Mohammad Tariq
Hi group,

I have a streaming job which reads data from Kafka, performs some
computation and pushes the result into HBase. Actually the results are
pushed into 3 different HBase tables. So I was wondering what could be the
best way to achieve this.

Since each executor will open its own HBase connection and write data to a
regionserver independent of rest of the executors I feel this is a bit of
overkill. How about collecting the results of each micro batch and putting
them in one shot at the end of that batch?

If so what should be the way to go about this?

Many thanks!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



Re: Running Spark in Standalone or local modes

2016-06-11 Thread Mohammad Tariq
Hi Ashok,

In local mode all the processes run inside a single jvm, whereas in
standalone mode we have separate master and worker processes running in
their own jvms.

To quickly test your code from within your IDE you could probable use the
local mode. However, to get a real feel of how Spark operates I would
suggest you to have a standalone setup as well. It's just the matter
of launching a standalone cluster either manually(by starting a master and
workers by hand), or by using the launch scripts provided with Spark
package.

You can find more on this *here*
.

HTH



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Sat, Jun 11, 2016 at 11:38 PM, Ashok Kumar 
wrote:

> Hi,
>
> What is the difference between running Spark in Local mode or standalone
> mode?
>
> Are they the same. If they are not which is best suited for non prod work.
>
> I am also aware that one can run Spark in Yarn mode as well.
>
> Thanks
>


DataFrame.foreach(scala.Function1) example

2016-06-10 Thread Mohammad Tariq
Dear fellow spark users,

Could someone please point me to any example showcasing the usage of
*DataFrame.oreach(scala.Function1)* in *Java*?

*Problem statement :* I am reading data from a Kafka topic, and for each
RDD in the DStream I am creating a DataFrame in order to perform some
operations. After this I have to call *DataFrame.javaRDD* to convert the
resulting DF back into a *JavaRDD* so that I can perform further
computation on each record in this RDD through *JavaRDD.foreach*.

However, I wish to remove this additional hop of creating JavaRDD from the
resultant DF. I would like to use *DataFrame.oreach(scala.Function1)*
instead, and perform the computation directly on each *Row* of this DF
rather than having to first convert the DF into JavaRDD and then performing
any computations.

For interfaces like *Function* and *VoidFunction*(provided by
org.apache.spark.api.java.function API) I could implement named classes and
use them in my code. But having some hard time in figuring out how to use
*scala.Function1*

Thank you so much for your valuable time. Really appreciate it!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



Re: Spark streaming readind avro from kafka

2016-06-01 Thread Mohammad Tariq
Hi Neeraj,

You might find Kafka-Direct
 useful.
BTW, are you using something like Confluent for you Kafka setup. If that's
the case you might leverage Schema registry to get hold of the associated
schema without additional effort.

And for the DataFrame part you could do something like this :

JavaPairDStream messages =
KafkaUtils.createDirectStream(jssc, Object.class, Object.class,
KafkaAvroDecoder.class, KafkaAvroDecoder.class, kafkaParams,
topicsSet);
JavaDStream records = messages.map(new Function, String>() {
  @Override
  public String call(Tuple2 tuple2) {
return tuple2._2().toString();
  }
});
records.foreachRDD(new Function() {
  @Override
  public Void call(JavaRDD rdd) throws Exception {
if (!rdd.isEmpty()) {
  SQLContext sqlContext = SQLContext.getOrCreate(rdd.context());
  DataFrame record = sqlContext.read().json(rdd);
  record.registerTempTable("dataAsTable");
  DataFrame dedupeFields = sqlContext.sql("SELECT name,
metadata.RBA, metadata.SHARD, "
+ "metadata.SCHEMA, metadata.`TABLE`, metadata.EVENTTYPE,
metadata.SOURCE_TS FROM dataAsTable");
}
return null;
  }
});
jssc.start();
jssc.awaitTermination();
  }



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Thu, Jun 2, 2016 at 1:59 AM, Igor Berman  wrote:

> Avro file contains metadata with schema(writer schema)
> in Kafka there is no such thing, you should put message that will contain
> some reference to known schema(put whole schema will have big overhead)
> some people use schema registry solution
>
> On 1 June 2016 at 21:02, justneeraj  wrote:
>
>> +1
>>
>> I am trying to read avro from kafka and I don't want to limit to a small
>> set
>> of schema. So I want to dynamically load the schema from avro file (as
>> avro
>> contains schema as well). And then from this I want to create a dataframe
>> and run some queries on that.
>>
>> Any help would be really thankful.
>>
>> Thanks,
>> Neeraj
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-readind-avro-from-kafka-tp22425p27067.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Recommended way to close resources in a Spark streaming application

2016-05-31 Thread Mohammad Tariq
Dear fellow Spark users,

I have a streaming app which is reading data from Kafka, doing some
computations and storing the results into HBase. Since I am new to Spark
streaming I feel that there could still be scope of making my app better.

To begin with, I was wondering what's the best way to free up resources in
case of app shutdown(because of some exception, or some other cause). While
looking for some help online I bumped into the Spark doc which talks about
*spark.streaming.stopGracefullyOnShutdown*. This says I*f true, Spark shuts
down the StreamingContext gracefully on JVM shutdown rather than
immediately*.

Or, does it make more sense to add a *ShutdownHook* explicitly in my app
and call JavaStreamingContext.stop()?

One potential benefit which I see with *ShutdownHook* is that I could close
any external resources before the JVM dies inside the *run()* method.

Thoughts/suggestions??

Also, I am banking on the fact that Kafka direct takes care of exact once
data delivery. It'll start consuming data after the point where the app had
crashed. Is there any way I can restart my streaming app automatically in
case of a failure?

I'm really sorry to be a pest of questions. I could not satisfy myself with
the answers I found online. Thank you so much for your valuable time.
Really appreciate it!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
I think this could be the reason :

DataFrame sorts the column of each record lexicographically if we do a *select
**. So, if we wish to maintain a specific column ordering while processing
we should use do *select col1, col2...* instead of select *.

However, this is just what I feel. Let's wait for comments from the gurus.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Thu, Mar 3, 2016 at 5:35 AM, Mohammad Tariq <donta...@gmail.com> wrote:

> Cool. Here is it how it goes...
>
> I am reading Avro objects from a Kafka topic as a DStream, converting it
> into a DataFrame so that I can filter out records based on some conditions
> and finally do some aggregations on these filtered records. During the
> process I also need to tag each record based on the value of a particular
> column, and for this I am iterating over Array[Row] returned by
> DataFrame.collect().
>
> I am good as far as these things are concerned. The only thing which I am
> not getting is the reason behind changed column ordering within each Row.
> Say my actual record is [Tariq, IN, APAC]. When I
> do println(row.mkString("~")) it shows [IN~APAC~Tariq].
>
> I hope I was able to explain my use case to you!
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla <pallasain...@gmail.com>
> wrote:
>
>> Hi Tariq,
>>
>> Can you tell in brief what kind of operation you have to do? I can try
>> helping you out with that.
>> In general, if you are trying to use any group operations you can use
>> window operations.
>>
>> On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq <donta...@gmail.com>
>> wrote:
>>
>>> Hi Sainath,
>>>
>>> Thank you for the prompt response!
>>>
>>> Could you please elaborate your answer a bit? I'm sorry I didn't quite
>>> get this. What kind of operation I can perform using SQLContext? It just
>>> helps us during things like DF creation, schema application etc, IMHO.
>>>
>>>
>>>
>>> [image: http://]
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>> <http://about.me/mti>
>>>
>>>
>>> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasain...@gmail.com>
>>> wrote:
>>>
>>>> Instead of collecting the data frame, you can try using a sqlContext on
>>>> the data frame. But it depends on what kind of operations are you trying to
>>>> perform.
>>>>
>>>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <donta...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi list,
>>>>>
>>>>> *Scenario :*
>>>>> I am creating a DStream by reading an Avro object from a Kafka topic
>>>>> and then converting it into a DataFrame to perform some operations on the
>>>>> data. I call DataFrame.collect() and perform the intended operation on 
>>>>> each
>>>>> Row of Array[Row] returned by DataFrame.collect().
>>>>>
>>>>> *Problem : *
>>>>> Calling DataFrame.collect() changes the schema of the underlying
>>>>> record, thus making it impossible to get the columns by index(as the order
>>>>> gets changed).
>>>>>
>>>>> *Query :*
>>>>> Is it the way DataFrame.collect() behaves or am I doing something
>>>>> wrong here? In former case is there any way I can maintain the schema 
>>>>> while
>>>>> getting each Row?
>>>>>
>>>>> Any pointers/suggestions would be really helpful. Many thanks!
>>>>>
>>>>>
>>>>> [image: http://]
>>>>>
>>>>> Tariq, Mohammad
>>>>> about.me/mti
>>>>> [image: http://]
>>>>> <http://about.me/mti>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Cool. Here is it how it goes...

I am reading Avro objects from a Kafka topic as a DStream, converting it
into a DataFrame so that I can filter out records based on some conditions
and finally do some aggregations on these filtered records. During the
process I also need to tag each record based on the value of a particular
column, and for this I am iterating over Array[Row] returned by
DataFrame.collect().

I am good as far as these things are concerned. The only thing which I am
not getting is the reason behind changed column ordering within each Row.
Say my actual record is [Tariq, IN, APAC]. When I
do println(row.mkString("~")) it shows [IN~APAC~Tariq].

I hope I was able to explain my use case to you!



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla <pallasain...@gmail.com>
wrote:

> Hi Tariq,
>
> Can you tell in brief what kind of operation you have to do? I can try
> helping you out with that.
> In general, if you are trying to use any group operations you can use
> window operations.
>
> On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq <donta...@gmail.com> wrote:
>
>> Hi Sainath,
>>
>> Thank you for the prompt response!
>>
>> Could you please elaborate your answer a bit? I'm sorry I didn't quite
>> get this. What kind of operation I can perform using SQLContext? It just
>> helps us during things like DF creation, schema application etc, IMHO.
>>
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasain...@gmail.com>
>> wrote:
>>
>>> Instead of collecting the data frame, you can try using a sqlContext on
>>> the data frame. But it depends on what kind of operations are you trying to
>>> perform.
>>>
>>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <donta...@gmail.com>
>>> wrote:
>>>
>>>> Hi list,
>>>>
>>>> *Scenario :*
>>>> I am creating a DStream by reading an Avro object from a Kafka topic
>>>> and then converting it into a DataFrame to perform some operations on the
>>>> data. I call DataFrame.collect() and perform the intended operation on each
>>>> Row of Array[Row] returned by DataFrame.collect().
>>>>
>>>> *Problem : *
>>>> Calling DataFrame.collect() changes the schema of the underlying
>>>> record, thus making it impossible to get the columns by index(as the order
>>>> gets changed).
>>>>
>>>> *Query :*
>>>> Is it the way DataFrame.collect() behaves or am I doing something wrong
>>>> here? In former case is there any way I can maintain the schema while
>>>> getting each Row?
>>>>
>>>> Any pointers/suggestions would be really helpful. Many thanks!
>>>>
>>>>
>>>> [image: http://]
>>>>
>>>> Tariq, Mohammad
>>>> about.me/mti
>>>> [image: http://]
>>>> <http://about.me/mti>
>>>>
>>>>
>>>
>>>
>>
>


Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi Sainath,

Thank you for the prompt response!

Could you please elaborate your answer a bit? I'm sorry I didn't quite get
this. What kind of operation I can perform using SQLContext? It just helps
us during things like DF creation, schema application etc, IMHO.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasain...@gmail.com>
wrote:

> Instead of collecting the data frame, you can try using a sqlContext on
> the data frame. But it depends on what kind of operations are you trying to
> perform.
>
> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <donta...@gmail.com> wrote:
>
>> Hi list,
>>
>> *Scenario :*
>> I am creating a DStream by reading an Avro object from a Kafka topic and
>> then converting it into a DataFrame to perform some operations on the data.
>> I call DataFrame.collect() and perform the intended operation on each Row
>> of Array[Row] returned by DataFrame.collect().
>>
>> *Problem : *
>> Calling DataFrame.collect() changes the schema of the underlying record,
>> thus making it impossible to get the columns by index(as the order gets
>> changed).
>>
>> *Query :*
>> Is it the way DataFrame.collect() behaves or am I doing something wrong
>> here? In former case is there any way I can maintain the schema while
>> getting each Row?
>>
>> Any pointers/suggestions would be really helpful. Many thanks!
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>
>


Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi list,

*Scenario :*
I am creating a DStream by reading an Avro object from a Kafka topic and
then converting it into a DataFrame to perform some operations on the data.
I call DataFrame.collect() and perform the intended operation on each Row
of Array[Row] returned by DataFrame.collect().

*Problem : *
Calling DataFrame.collect() changes the schema of the underlying record,
thus making it impossible to get the columns by index(as the order gets
changed).

*Query :*
Is it the way DataFrame.collect() behaves or am I doing something wrong
here? In former case is there any way I can maintain the schema while
getting each Row?

Any pointers/suggestions would be really helpful. Many thanks!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



Re: [Spark 1.5.2]: Iterate through Dataframe columns and put it in map

2016-03-02 Thread Mohammad Tariq
Hi Divya,

You could call *collect()* method provided by DataFram API. This will give
you an *Array[Rows]*. You could then iterate over this array and create
your map. Something like this :

val mapOfVals = scala.collection.mutable.Map[String,String]()
var rows = DataFrame.collect()
rows.foreach(r => mapOfVals.put(r.getString(0), r.getString(1)))
println("KEYS : " + mapOfVals.keys)
println("VALS : " + mapOfVals.values)

Hope this helps!





[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Wed, Mar 2, 2016 at 3:52 PM, Divya Gehlot 
wrote:

> Hi,
>
> I need to iterate through columns in dataframe based on certain condition
> and put it in map .
>
> Dataset
> Column1  Column2
> Car   Model1
> Bike   Model2
> Car Model2
> Bike   Model 2
>
> I want to iterate through above dataframe and put it in map where car is
> key and model1 and model 2 as values
>
>
> Thanks,
> Regards,
> Divya
>
>


Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-25 Thread Mohammad Tariq
Spark doesn't support subqueries in WHERE clause, IIRC. It supports
subqueries only in the FROM clause as of now. See this ticket
 for more on this.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Fri, Feb 26, 2016 at 7:01 AM, ayan guha  wrote:

> Why is this not working for you? Are you trying on dataframe? What error
> are you getting?
>
> On Thu, Feb 25, 2016 at 10:23 PM, Ashok Kumar <
> ashok34...@yahoo.com.invalid> wrote:
>
>> Hi,
>>
>> What is the equivalent of this in Spark please
>>
>> select * from mytable where column1 in (select max(column1) from mytable)
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Mohammad Tariq
I got it working by using jsonRDD. This is what I had to do in order to
make it work :

  val messages = KafkaUtils.createDirectStream[Object, Object,
KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicsSet)
  val lines = messages.map(_._2.toString)
  lines.foreachRDD(jsonRDD => {
val sqlContext =
SQLContextSingleton.getInstance(jsonRDD.sparkContext)
val data = sqlContext.read.json(jsonRDD)
data.printSchema()
data.show()
data.select("COL_NAME").show()
data.groupBy("COL_NAME").count().show()
  })

Not sure though if it's the best way to achieve this.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Fri, Feb 26, 2016 at 5:21 AM, Shixiong(Ryan) Zhu <shixi...@databricks.com
> wrote:

> You can use `DStream.map` to transform objects to anything you want.
>
> On Thu, Feb 25, 2016 at 11:06 AM, Mohammad Tariq <donta...@gmail.com>
> wrote:
>
>> Hi group,
>>
>> I have just started working with confluent platform and spark streaming,
>> and was wondering if it is possible to access individual fields from an
>> Avro object read from a kafka topic through spark streaming. As per its
>> default behaviour *KafkaUtils.createDirectStream[Object, Object,
>> KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicsSet)* return
>> a *DStream[Object, Object]*, and don't have any schema associated with
>> *Object*(or I am unable to figure it out). This makes it impossible to
>> perform some operations on this DStream, for example, converting it to a
>> Spark DataFrame.
>>
>> Since *KafkaAvroDecoder *doesn't allow us to have any other Class but
>> *Object *I think I am going in the wrong direction. Any
>> pointers/suggestions would be really helpful.
>>
>> *Versions used :*
>> confluent-1.0.1
>> spark-1.6.0-bin-hadoop2.4
>> Scala code runner version - 2.11.6
>>
>> And this is the small piece of code I am using :
>>
>> package org.myorg.scalaexamples
>>
>> import org.apache.spark.rdd.RDD
>> import org.apache.spark.SparkConf
>> import org.apache.spark.streaming._
>> import org.apache.spark.SparkContext
>> import org.apache.avro.mapred.AvroKey
>> import org.apache.spark.sql.SQLContext
>> //import org.apache.avro.mapred.AvroValue
>> import org.apache.spark.streaming.kafka._
>> import org.apache.spark.storage.StorageLevel
>> import org.apache.avro.generic.GenericRecord
>> import org.apache.spark.streaming.dstream.DStream
>> import io.confluent.kafka.serializers.KafkaAvroDecoder
>> //import org.apache.hadoop.io.serializer.avro.AvroRecord
>> //import org.apache.spark.streaming.dstream.ForEachDStream
>> import org.apache.spark.sql.SQLContext
>> import org.apache.kafka.common.serialization.Deserializer
>>
>> object DirectKafkaWordCount {
>>   def main(args: Array[String]) {
>> if (args.length < 2) {
>>   System.err.println(s"""
>> |Usage: DirectKafkaWordCount  
>> |   is a list of one or more Kafka brokers
>> |   is a list of one or more kafka topics to consume from
>> |
>> """.stripMargin)
>>   System.exit(1)
>> }
>> val Array(brokers, topics) = args
>> val sparkConf = new
>> SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[1]")
>>
>> sparkConf.registerKryoClasses(Array(classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]]))
>> val ssc = new StreamingContext(sparkConf, Seconds(5))
>> val topicsSet = topics.split(",").toSet
>> val kafkaParams = Map[String, String]("metadata.broker.list" ->
>> brokers, "group.id" -> "consumer",
>>   "zookeeper.connect" -> "localhost:2181", "schema.registry.url" -> "
>> http://localhost:8081;)
>> val messages = KafkaUtils.createDirectStream[Object, Object,
>> KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicsSet)
>> messages.print()
>> ssc.start()
>> ssc.awaitTermination()
>>   }
>> }
>>
>> Thank you so much for your valuable time!
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>
>


Re: Spark SQL support for sub-queries

2016-02-25 Thread Mohammad Tariq
AFAIK, this isn't supported yet. A ticket
 is in progress though.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Fri, Feb 26, 2016 at 4:16 AM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

>
>
> Hi,
>
>
>
> I guess the following confirms that Spark does bot support sub-queries
>
>
>
> val d = HiveContext.table("test.dummy")
>
> d.registerTempTable("tmp")
>
> HiveContext.sql("select * from tmp where id IN (select max(id) from tmp)")
>
> It crashes
>
> The SQL works OK in Hive itself on the underlying table!
>
> select * from dummy where id IN (select max(id) from dummy);
>
>
>
> Thanks
>
>
> --
>
> Dr Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>


Access fields by name/index from Avro data read from Kafka through Spark Streaming

2016-02-25 Thread Mohammad Tariq
Hi group,

I have just started working with confluent platform and spark streaming,
and was wondering if it is possible to access individual fields from an
Avro object read from a kafka topic through spark streaming. As per its
default behaviour *KafkaUtils.createDirectStream[Object, Object,
KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicsSet)*
return a *DStream[Object,
Object]*, and don't have any schema associated with *Object*(or I am unable
to figure it out). This makes it impossible to perform some operations on
this DStream, for example, converting it to a Spark DataFrame.

Since *KafkaAvroDecoder *doesn't allow us to have any other Class but
*Object *I think I am going in the wrong direction. Any
pointers/suggestions would be really helpful.

*Versions used :*
confluent-1.0.1
spark-1.6.0-bin-hadoop2.4
Scala code runner version - 2.11.6

And this is the small piece of code I am using :

package org.myorg.scalaexamples

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.SparkContext
import org.apache.avro.mapred.AvroKey
import org.apache.spark.sql.SQLContext
//import org.apache.avro.mapred.AvroValue
import org.apache.spark.streaming.kafka._
import org.apache.spark.storage.StorageLevel
import org.apache.avro.generic.GenericRecord
import org.apache.spark.streaming.dstream.DStream
import io.confluent.kafka.serializers.KafkaAvroDecoder
//import org.apache.hadoop.io.serializer.avro.AvroRecord
//import org.apache.spark.streaming.dstream.ForEachDStream
import org.apache.spark.sql.SQLContext
import org.apache.kafka.common.serialization.Deserializer

object DirectKafkaWordCount {
  def main(args: Array[String]) {
if (args.length < 2) {
  System.err.println(s"""
|Usage: DirectKafkaWordCount  
|   is a list of one or more Kafka brokers
|   is a list of one or more kafka topics to consume from
|
""".stripMargin)
  System.exit(1)
}
val Array(brokers, topics) = args
val sparkConf = new
SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[1]")

sparkConf.registerKryoClasses(Array(classOf[org.apache.avro.mapred.AvroWrapper[GenericRecord]]))
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" ->
brokers, "group.id" -> "consumer",
  "zookeeper.connect" -> "localhost:2181", "schema.registry.url" -> "
http://localhost:8081;)
val messages = KafkaUtils.createDirectStream[Object, Object,
KafkaAvroDecoder, KafkaAvroDecoder](ssc, kafkaParams, topicsSet)
messages.print()
ssc.start()
ssc.awaitTermination()
  }
}

Thank you so much for your valuable time!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



Spark with proxy

2015-09-08 Thread Mohammad Tariq
Hi friends,

Is it possible to interact with Amazon S3 using Spark via a proxy? This is
what I have been doing :

SparkConf conf = new
SparkConf().setAppName("MyApp").setMaster("local");
JavaSparkContext sparkContext = new JavaSparkContext(conf);
Configuration hadoopConf = sparkContext.hadoopConfiguration();
hadoopConf.set("fs.s3n.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3n.awsAccessKeyId", "***");
hadoopConf.set("fs.s3n.awsSecretAccessKey", "***");
hadoopConf.set("httpclient.proxy-autodetect", "false");
hadoopConf.set("httpclient.proxy-host", "***");
hadoopConf.set("httpclient.proxy-port", "");
SQLContext sqlContext = new SQLContext(sparkContext);

But whenever I try to run it, it says :

java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:625)
at
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:524)
at
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:403)
at
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
at
org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144)
at
org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:131)
at
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:611)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:446)
at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:326)
at
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277)
at
org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1038)
at
org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2250)
at
org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2179)
at
org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120)
at
org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575)
at
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:172)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at org.apache.hadoop.fs.s3native.$Proxy21.retrieveMetadata(Unknown Source)
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:105)
at com.databricks.spark.avro.AvroRelation.(AvroRelation.scala:60)
at
com.databricks.spark.avro.DefaultSource.createRelation(DefaultSource.scala:41)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:673)

The same proxy is working fine with AWS S3 Java API, and the JetS3t API. I
could not find any document that explains about setting proxies in a Spark
program. Could someone please point me to the right direction?

Many thanks.

Thank

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]



Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Mohammad Tariq
I would really appreciate if someone could help me with this.

On Monday, June 15, 2015, Mohammad Tariq donta...@gmail.com wrote:

 Hello list,

 The method *insertIntoJDBC(url: String, table: String, overwrite:
 Boolean)* provided by Spark DataFrame allows us to copy a DataFrame into
 a JDBC DB table. Similar functionality is provided by the 
 *createJDBCTable(url:
 String, table: String, allowExisting: Boolean) *method. But if you look
 at the docs it says that *createJDBCTable *runs a *bunch of Insert
 statements* in order to copy the data. While the docs of *insertIntoJDBC 
 *doesn't
 have any such statement.

 Could someone please shed some light on this? How exactly data gets
 inserted using *insertIntoJDBC *method?

 And if it is same as *createJDBCTable *method, then what exactly does *bunch
 of Insert statements* mean? What's the criteria to decide the number
 *inserts/bunch*? How are these bunches generated?

 *An example* could be creating a DataFrame by reading all the files
 stored in a given directory. If I just do *DataFrame.save()*, it'll
 create the same number of output files as the input files. What'll happen
 in case of *DataFrame.df.insertIntoJDBC()*?

 I'm really sorry to be pest of questions, but I could net get much help by
 Googling about this.

 Thank you so much for your valuable time. really appreciate it.

 [image: http://]
 Tariq, Mohammad
 about.me/mti
 [image: http://]
 http://about.me/mti




-- 

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
http://about.me/mti


DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-15 Thread Mohammad Tariq
Hello list,

The method *insertIntoJDBC(url: String, table: String, overwrite: Boolean)*
provided by Spark DataFrame allows us to copy a DataFrame into a JDBC DB
table. Similar functionality is provided by the *createJDBCTable(url:
String, table: String, allowExisting: Boolean) *method. But if you look at
the docs it says that *createJDBCTable *runs a *bunch of Insert statements*
in order to copy the data. While the docs of *insertIntoJDBC *doesn't have
any such statement.

Could someone please shed some light on this? How exactly data gets
inserted using *insertIntoJDBC *method?

And if it is same as *createJDBCTable *method, then what exactly does *bunch
of Insert statements* mean? What's the criteria to decide the number
*inserts/bunch*? How are these bunches generated?

*An example* could be creating a DataFrame by reading all the files stored
in a given directory. If I just do *DataFrame.save()*, it'll create the
same number of output files as the input files. What'll happen in case of
*DataFrame.df.insertIntoJDBC()*?

I'm really sorry to be pest of questions, but I could net get much help by
Googling about this.

Thank you so much for your valuable time. really appreciate it.

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
http://about.me/mti


Transactional guarantee while saving DataFrame into a DB

2015-06-02 Thread Mohammad Tariq
Hi list,

With the help of Spark DataFrame API we can save a DataFrame into a
database table through insertIntoJDBC() call. However, I could not find any
info about how it handles the transactional guarantee. What if my program
gets killed during the processing? Would it end up in partial load?

Is it somehow possible to handle these kind of scenarios? Rollback or
something of that sort?

Many thanks.

P.S : I am using spark-1.3.1-bin-hadoop2.4 with java 1.7

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
http://about.me/mti


Re: Forbidded : Error Code: 403

2015-05-18 Thread Mohammad Tariq
 bytes result sent to driver
15/05/18 17:35:56 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID
2, localhost, PROCESS_LOCAL, 1306 bytes)
15/05/18 17:35:56 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
15/05/18 17:35:56 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID
1) in 2224 ms on localhost (1/596)
15/05/18 17:35:56 INFO HadoopRDD: Input split:
s3a://bucket-name/avro_data/episodes.avro:2+1
15/05/18 17:35:56 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:56 INFO BlockManager: Removing broadcast 1
15/05/18 17:35:56 INFO BlockManager: Removing block broadcast_1_piece0
15/05/18 17:35:56 INFO MemoryStore: Block broadcast_1_piece0 of size 2386
dropped from memory (free 69905618)
15/05/18 17:35:56 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
localhost:60317 in memory (size: 2.3 KB, free: 66.9 MB)
15/05/18 17:35:56 INFO BlockManagerMaster: Updated info of block
broadcast_1_piece0
15/05/18 17:35:56 INFO BlockManager: Removing block broadcast_1
15/05/18 17:35:56 INFO MemoryStore: Block broadcast_1 of size 3448 dropped
from memory (free 69909066)
15/05/18 17:35:56 INFO ContextCleaner: Cleaned broadcast 1
15/05/18 17:35:56 INFO S3AFileSystem: Opening
's3a://bucket-name/avro_data/episodes.avro' for reading
15/05/18 17:35:56 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:56 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:57 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -4
15/05/18 17:35:57 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:58 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -595
15/05/18 17:35:58 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 2
15/05/18 17:35:58 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2).
1800 bytes result sent to driver
15/05/18 17:35:58 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID
3, localhost, PROCESS_LOCAL, 1306 bytes)
15/05/18 17:35:58 INFO Executor: Running task 2.0 in stage 1.0 (TID 3)
15/05/18 17:35:58 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID
2) in 2655 ms on localhost (2/596)
15/05/18 17:35:58 INFO HadoopRDD: Input split:
s3a://bucket-name/avro_data/episodes.avro:3+1
15/05/18 17:35:58 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:58 INFO S3AFileSystem: Opening
's3a://bucket-name/avro_data/episodes.avro' for reading
15/05/18 17:35:58 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:59 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
.
.
.
.

And this is my code :

public static void main(String[] args) {

System.out.println(START...);
SparkConf conf = new
SparkConf().setAppName(DataFrameDemo).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration config = sc.hadoopConfiguration();
 //FOR s3a
config.set(fs.s3a.impl, org.apache.hadoop.fs.s3a.S3AFileSystem);
config.set(fs.s3a.access.key,**);
config.set(fs.s3a.secret.key,***);
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.load(s3a://bucket-name/avro_data/episodes.avro,
com.databricks.spark.avro);
// DataFrame df = sqlContext.load(/Users/miqbal1/avro_data/episodes.avro,
com.databricks.spark.avro);
df.show();
df.printSchema();
df.select(name).show();
System.out.println(DONE);
}

While the same code is working fine with a local file. Am I missing
something here? Any help would be highly appreciated.

Thank you.

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
http://about.me/mti


On Sun, May 17, 2015 at 8:51 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 I think you can try this way also:

 DataFrame df = 
 sqlContext.load(s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro,
 com.databricks.spark.avro);


 Thanks
 Best Regards

 On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq donta...@gmail.com
 wrote:

 Thanks for the suggestion Steve. I'll try that out.

 Read the long story last night while struggling with this :). I made sure
 that I don't have any '/' in my key.

 On Saturday, May 16, 2015, Steve Loughran ste...@hortonworks.com wrote:


  On 15 May 2015, at 21:20, Mohammad Tariq donta...@gmail.com wrote:
 
  Thank you Ayan and Ted for the prompt response. It isn't working with
 s3n either.
 
  And I am able to download the file. In fact I am able to read the same
 file using s3 API without any issue.
 


 sounds like an S3n config problem. Check your configurations - you can
 test locally via the hdfs dfs command without even starting spark

  Oh, and if there is a / in your secret key, you're going to to need
 to generate new one. Long story

Forbidded : Error Code: 403

2015-05-15 Thread Mohammad Tariq
Hello list,

*Scenario : *I am trying to read an Avro file stored in S3 and create a
DataFrame out of it using *Spark-Avro*
https://github.com/databricks/spark-avro library, but unable to do so.
This is the code which I am using :

public class S3DataFrame {

public static void main(String[] args) {

System.out.println(START...);
SparkConf conf = new
SparkConf().setAppName(DataFrameDemo).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration config = sc.hadoopConfiguration();
config.set(fs.s3a.impl, org.apache.hadoop.fs.s3a.S3AFileSystem);
config.set(fs.s3a.access.key,);
config.set(fs.s3a.secret.key,*);
config.set(fs.s3a.endpoint, s3-us-west-2.amazonaws.com);
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.load(s3a://bucket-name/file.avro,
com.databricks.spark.avro);
df.show();
df.printSchema();
df.select(title).show();
System.out.println(DONE);
// df.save(/new/dir/, com.databricks.spark.avro);
}
}

*Problem :* *Getting Exception in thread main
com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
Amazon S3; Status Code: 403; Error Code: 403 Forbidden;*

And this is the complete exception trace :

Exception in thread main
com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID:
63A603F1DC6FB900), S3 Extended Request ID:
vh5XhXSVO5ZvhX8c4I3tOWQD/T+B0ZW/MCYzUnuNnQ0R2JoBmJ0MPmUePRiQnPVASTbkonoFPIg=
at
com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1088)
at
com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:735)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:461)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:296)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3743)
at
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1027)
at
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1005)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:688)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:71)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:248)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1623)
at com.databricks.spark.avro.AvroRelation.newReader(AvroRelation.scala:105)
at com.databricks.spark.avro.AvroRelation.init(AvroRelation.scala:60)
at
com.databricks.spark.avro.DefaultSource.createRelation(DefaultSource.scala:41)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:673)
at org.myorg.dataframe.S3DataFrame.main(S3DataFrame.java:25)


Would really appreciate some help. Thank you so much for your precious time.

*Software versions used :*
spark-1.3.1-bin-hadoop2.4
hadoop-aws-2.6.0.jar
MAC OS X 10.10.3
java version 1.6.0_65

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
http://about.me/mti


Re: Forbidded : Error Code: 403

2015-05-15 Thread Mohammad Tariq
Thanks for the suggestion Steve. I'll try that out.

Read the long story last night while struggling with this :). I made sure
that I don't have any '/' in my key.

On Saturday, May 16, 2015, Steve Loughran ste...@hortonworks.com wrote:


  On 15 May 2015, at 21:20, Mohammad Tariq donta...@gmail.com
 javascript:; wrote:
 
  Thank you Ayan and Ted for the prompt response. It isn't working with
 s3n either.
 
  And I am able to download the file. In fact I am able to read the same
 file using s3 API without any issue.
 


 sounds like an S3n config problem. Check your configurations - you can
 test locally via the hdfs dfs command without even starting spark

  Oh, and if there is a / in your secret key, you're going to to need to
 generate new one. Long story



-- 

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
http://about.me/mti