RE: Hive using Spark engine alone

2015-11-27 Thread Mich Talebzadeh
] Sent: 27 November 2015 14:03 To: Mich Talebzadeh <m...@peridale.co.uk> Cc: user <user@spark.apache.org> Subject: Re: Hive using Spark engine alone Hi, I recommend to use the latest version of Hive. You may also wait for hive on tez with tez version >= 0.8 and hive >

FW: Managed to make Hive run on Spark engine

2015-12-07 Thread Mich Talebzadeh
For those interested From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 06 December 2015 20:33 To: u...@hive.apache.org Subject: Managed to make Hive run on Spark engine Thanks all especially to Xuefu.for contributions. Finally it works, which means don’t give up until it works

Getting error when trying to start master node after building spark 1.3

2015-12-07 Thread Mich Talebzadeh
Thanks sorted. Actually I used version 1.3.1 and now I managed to make it work as Hive execution engine. Cheers, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files

RE: spark sql current time stamp function ?

2015-12-07 Thread Mich Talebzadeh
Or try this cast(from_unixtime(unix_timestamp()) AS timestamp HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author

RE: spark1.4.1 extremely slow for take(1) or head() or first() or show

2015-12-03 Thread Mich Talebzadeh
), otherwise it will have to use disk space. So it boils down to how much memory you have. HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning

Building spark 1.3 from source code to work with Hive 1.2.1

2015-12-03 Thread Mich Talebzadeh
Hi, I have seen mails that state that the user has managed to build spark 1.3 to work with Hive. I tried Spark 1.5.2 but no luck I downloaded spark source 1.3 source code spark-1.3.0.tar and built it as follows ./make-distribution.sh --name "hadoop2-without-hive" --tgz

Getting error when trying to start master node after building spark 1.3

2015-12-04 Thread Mich Talebzadeh
Hi, I am trying to make Hive work with Spark. I have been told that I need to use Spark 1.3 and build it from source code WITHOUT HIVE libraries. I have built it as follows: ./make-distribution.sh --name "hadoop2-without-hive" --tgz

anyone who can help me out with thi error please

2015-12-04 Thread Mich Talebzadeh
Hi, I am trying to make Hive work with Spark. I have been told that I need to use Spark 1.3 and build it from source code WITHOUT HIVE libraries. I have built it as follows: ./make-distribution.sh --name "hadoop2-without-hive" --tgz

has someone seen this error please?

2015-12-04 Thread Mich Talebzadeh
Hi, I am trying to make Hive work with Spark. I have been told that I need to use Spark 1.3 and build it from source code WITHOUT HIVE libraries. I have built it as follows: ./make-distribution.sh --name "hadoop2-without-hive" --tgz

RE: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Mich Talebzadeh
and try again Thanks, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase A

RE: Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Mich Talebzadeh
Thanks I tried all :( I am trying to make Hive use Spark and apparently Hive can use version 1.3 of Spark as execution engine. Frankly I don’t know why this is not working! Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial

Any clue on this error, Exception in thread "main" java.lang.NoSuchFieldError: SPARK_RPC_CLIENT_CONNECT_TIMEOUT

2015-12-03 Thread Mich Talebzadeh
ClientImpl:at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) 15/12/03 17:53:19 [stderr-redir-1]: INFO client.SparkClientImpl:at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Any clues? Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Runnin

RE: In yarn-client mode, is it the driver or application master that issue commands to executors?

2015-11-27 Thread Mich Talebzadeh
). There will be a job scheduler and one or more Spark Executors depending on the cluster. So as far as I can see both diagrams are correct, HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 <http://login.sybase.com/fi

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: 25 November 2015 22:35 To: Mich Talebzadeh <m...@peridale.co.uk> Cc

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
/1312M [INFO] Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

RE: Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
Rule 0: org.apache.maven.plugins.enforcer.RequireMavenVersion failed with message: Detected Maven Version: 3.3.1 is not in the allowed range 3.3.3. Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 http://login.sybase.com

Building Spark without hive libraries

2015-11-25 Thread Mich Talebzadeh
Hi, I am trying to build sparc from the source and not using Hive. I am getting [error] Required file not found: scala-compiler-2.10.4.jar [error] See zinc -help for information about locating necessary files I have to run this as root otherwise build does not progress. Any help is

starting start-master.sh throws "java.lang.ClassNotFoundException: org.slf4j.Logger" error

2015-11-26 Thread Mich Talebzadeh
.loadClass(ClassLoader.java:357) ... 6 more Although I have added to the CLASSPATH. Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strateg

Hive using Spark engine alone

2015-11-27 Thread Mich Talebzadeh
. The primary reason I want to use Hive on Spark engine is for performance. Thanks, Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091

FW: starting spark-shell throws /tmp/hive on HDFS should be writable error

2015-11-20 Thread Mich Talebzadeh
From: Mich Talebzadeh [mailto:m...@peridale.co.uk] Sent: 20 November 2015 21:14 To: u...@hive.apache.org Subject: starting spark-shell throws /tmp/hive on HDFS should be writable error Hi, Has this been resolved. I don't think this has anything to do with /tmp/hive directory permission

Re: Stream reading from database using spark streaming

2016-06-02 Thread Mich Talebzadeh
rvers" -> "rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;, "zookeeper.connect" -> "rhes564:2181", "group.id" -> "CEP_streaming_with_JDBC" ) val topics = Set("newtopic") val dstream =

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Mich Talebzadeh
hang on are you saving this as a new table? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Mich Talebzadeh
ok what is the new column is called? you are basically adding a new column to an already existing table Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: how to increase threads per executor

2016-06-02 Thread Mich Talebzadeh
interesting. a vm with one core! one simple test can you try running with --executor-cores=1 and see it works ok please Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Twitter streaming error : No lease on /user/hduser/checkpoint/temp (inode 806125): File does not exist.

2016-06-03 Thread Mich Talebzadeh
like a connection is left open but cannot establish why! Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Mich Talebzadeh
what version of spark are you using Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 3 June 2016

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Mich Talebzadeh
by dt in notime Now what I don't understand whether that table is already partitioned as you said the table already exists! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
/Hadoop/slaves HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 5 June 2016 at 10:50, Marco Cap

Re: Twitter streaming error : No lease on /user/hduser/checkpoint/temp (inode 806125): File does not exist.

2016-06-03 Thread Mich Talebzadeh
sure I am trying to use SparkContext.setCheckpointDir(directory: String) to set it up. I agree that once one start creating subdirectory like "~/checkpoints/${APPLICATION_NAME}/${USERNAME}!" it becomes a bit messy cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/pr

Re: Twitter streaming error : No lease on /user/hduser/checkpoint/temp (inode 806125): File does not exist.

2016-06-03 Thread Mich Talebzadeh
Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 3 June 2016 at 20:48, Mich Talebzadeh <mich.talebza...@gmail.c

Re: Twitter streaming error : No lease on /user/hduser/checkpoint/temp (inode 806125): File does not exist.

2016-06-03 Thread Mich Talebzadeh
} } I need to change one of these. Actually a better alternative would be that each application has its own checkpoint? THanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gB

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
I use YARN as I run Hive on Spark engine in yarn-cluster mode plus other stuff. if I turn off YARN half of my applications won't work. I don't see great concern for supporting YARN. However you may have other reasons Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh
Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 5 June 2016 at 14:09, Mich Talebzadeh <mic

Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
ask 0 in stage 1.0 failed 1 times; aborting job Suggested solution. In a concurrent env, Spark should apply locks in order to prevent such operations. Locks are kept in Hive meta data table HIVE_LOCKS HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
table is locked as SHARED_READ 2. With Spark --> No locks at all 3. With HIVE --> No locks on the target table 4. With Spark --> No locks at all HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https:

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
issue here HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 8 June 2016 at 22:36, Michael

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
ive there is the issue with DDL + DML locks applied in a single transaction i.e. --> create table A as select * from b HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profil

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
check port 8080 on the node that you started start-master.sh [image: Inline images 2] HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Mich Talebzadeh
that Spark assumes no concurrency for Hive table. It is probably the same reason why updates/deletes to Hive ORC transactional tables through Spark fail. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
if they are down? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 June 2016 at 01:27,

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
I assume you zookeeper is up and running can you confirm that you are getting topics from kafka independently for example on the command line ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181 --from-beginning --topic newtopic Dr Mich Talebzadeh LinkedIn * https

Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
ool/language to dif in to that data. For example twitter streaming data. I am getting all sorts od stuff coming in. Say I am only interested in certain topics like sport etc. How can I detect the signal from the noise using what tool and language? Thanks Dr Mich Talebzadeh LinkedIn * ht

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
ou are aggregating data that you are collecting over the batch Window val countByValueAndWindow = price.filter(_ > 95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval)) countByValueAndWindow.print() // ssc.start() ssc.awaitTermination() HTH Dr Mich Talebzadeh LinkedIn * https://w

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
*/java HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 7 June 2016 at 11:59, Dominik S

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
and the output below at the same time running tol see the exact cause of it ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181 --from-beginning --topic newtopic Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <ht

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
by default the driver will start where you have started sbin/start-master.sh. that is where you start you app SparkSubmit. The slaves have to have an entry in slaves file What is the issue here? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-06 Thread Mich Talebzadeh
} \ ${OUTPUT_FILE_INTERVAL_IN_SECS:-10} \ ${OUTPUT_FILE_PARTITIONS_EACH_INTERVAL:-1} \ Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: Sequential computation over several partitions

2016-06-07 Thread Mich Talebzadeh
? the issue I believe you may face as you go from t0-> t1-.tn you volume of data is going to rise. How about periodic storage of your analysis and working on deltas only afterwards? What sort of data is it? Is it typical web-users? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/prof

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
Interesting. There is also apache nifi <https://nifi.apache.org/> Also I note that one can store twitter data in Hive tables as well? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profil

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
probability much more powerful than other nodes. Also the node that running resource manager is also running one of the node manager as well. So in theory may be in practice may not? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
all resources in use all the time. However, resource manager itself is on the resource manager node. Now I always start my Spark app on the same node as the resource manager node and let Yarn take care of the rest. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
"indexing") via JSON, XML, CSV or binary over HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary results. thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/pr

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
a typical question. You mentioned Spark ml (machine learning?) . Is that something viable? Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
thanks I will have a look. Mich Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 7 June 2016 at

Re: Spark_Usecase

2016-06-07 Thread Mich Talebzadeh
I use Spark rather that Sqoop to import data from an Oracle table into a Hive ORC table. It used JDBC for this purpose. All inclusive in Scala itself. Also Hive runs on Spark engine. Order of magnitude faster with Inde on map-reduce/. pretty simple. HTH Dr Mich Talebzadeh LinkedIn

Re: RESOLVED - Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
OK so this was Kafka issue? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 7 June 2016 at

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
to make much difference. In sounds like yarn-cluster supercedes yarn-client? Any comments welcome Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6z

What is the interpretation of Cores in Spark doc

2016-06-12 Thread Mich Talebzadeh
trust that I am not nitpicking here! Cheers, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com

Re: Questions about Spark Worker

2016-06-12 Thread Mich Talebzadeh
Hi, You basically want to use wired/Ethernet connections as opposed to wireless? in Your Spark Web UI under environment table what do you get for " spark.driver.host". Also can you cat /etc/hosts and send the output please and the output from ifconfig -a HTH Dr Mich Talebzadeh

Re: Long Running Spark Streaming getting slower

2016-06-10 Thread Mich Talebzadeh
Hi John, I did not notice anything unusual in your env variables. However, what are the batch interval, the windowsLength and SlindingWindow interval. Also how many messages are sent by Kafka in a typical batch interval? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile

Re: Long Running Spark Streaming getting slower

2016-06-10 Thread Mich Talebzadeh
is the nature of this spark streaming if you can divulge on it? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Mich Talebzadeh
not make it worthwhile. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 10 June 2016 at

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Mich Talebzadeh
etween a pair of "" will be interpreted as text NOT column name. In Spark SQL you do not need double quotes. So simply spark-sql> select prod_id, cust_id from sales limit 2; 17 28017 18 10419 HTH Dr Mich Talebzadeh LinkedIn * https://www.li

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

2016-06-10 Thread Mich Talebzadeh
t;, "orc.row.index.stride"="1" ) """ HiveContext.sql(sqltext) // // Put data in Hive table. Clean up is already done // sqltext = """ INSERT INTO TABLE oraclehadoop.dummy SELECT ID , CLUSTERED , S

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-09 Thread Mich Talebzadeh
Hi Rutuja, I am not certain whether such tool exists or not, However, opening a JIRA may be beneficial and would not do any harm. You may look for workaround. Now my understanding is that your need is for monitoring the health of the cluster? HTH Dr Mich Talebzadeh LinkedIn * https

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Mich Talebzadeh
how are you doing the insert? from an existing table? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpre

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh
w to deduce if there was indeed spillage to disk by Spark see (TungstenAggregate) ​ HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Mich Talebzadeh
cam you provide a code snippet of how you are populating the target table from temp table. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh
are you using map-reduce with Hive? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 June 2016

Re: Twitter streaming error : No lease on /user/hduser/checkpoint/temp (inode 806125): File does not exist.

2016-06-03 Thread Mich Talebzadeh
0 2016-06-03 23:38 /user/hduser/checkpoint/TwitterAnalyzer$/receivedBlockMetadata -rw-r--r-- 2 hduser supergroup 5199 2016-06-03 23:39 /user/hduser/checkpoint/TwitterAnalyzer$/temp It works fine. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin

twitter data analysis

2016-06-03 Thread Mich Talebzadeh
I know this question may not be directly relevant but what are the main approaches, one real time analysis of twitter using spark streaming and the other store data in hdfs and use later.? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view

Re: how to increase threads per executor

2016-06-03 Thread Mich Talebzadeh
E} \ ${OUTPUT_DIRECTORY:-/tmp/tweets} \ ${NUM_TWEETS_TO_COLLECT:-10000} \ ${OUTPUT_FILE_INTERVAL_IN_SECS:-10} \ ${OUTPUT_FILE_PARTITIONS_EACH_INTERVAL:-1} \ >> ${LOG_FILE} Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEW

Re: twitter data analysis

2016-06-03 Thread Mich Talebzadeh
om> wrote: > Or combine both! It is possible with Spark Streaming to combine streaming > data and on HDFS. In the end it always depends what you want to do and when > you need what. > > On 03 Jun 2016, at 10:26, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > &g

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Mich Talebzadeh
. Unlike Local or Spark standalone modes, in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn HTH Dr Mich Talebzadeh LinkedIn * https

Re: Running Spark in Standalone or local modes

2016-06-11 Thread Mich Talebzadeh
st and your best hope is using all the available cores. Hence in summary by using Spark in standalone mode (actually this terminology is a bit misleading, it would be better if they called it Spark Own Scheduler Mode (OSM)), you will have better performance due to clustering nature of Spark. HTH Dr Mi

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Mich Talebzadeh
yes absolutely Ted. Thanks for highlighting it Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Mich Talebzadeh
(and these are pretty respectful when it comes to Spark) and progress from there. If you have a certain problem then put to this group and I am sure someone somewhere in this forum has come across it. Also most of these books' authors actively contribute to this mailing list. HTH Dr Mich Talebzadeh

Re: Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Mich Talebzadeh
it is a good to be in control :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 14 June 2016

Re: Create external table with partitions using sqlContext.createExternalTable

2016-06-14 Thread Mich Talebzadeh
t;="0.05", "orc.stripe.size"="268435456", "orc.row.index.stride"="1" ) """ sql(sqltext) sql("select count(1) from test.orctype").show res2: org.apache.spark.sql.DataFrame = [result: string] +--

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Mich Talebzadeh
In all probability there is no user database created in Hive Create a database yourself sql("create if not exists database test") It would be helpful if you grasp some concept of Hive databases etc? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profi

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-14 Thread Mich Talebzadeh
Hi Swetha, Have you actually tried doing this in Hive using Hive CLI or beeline? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: can not show all data for this table

2016-06-14 Thread Mich Talebzadeh
ption, Value, Balance, AccountName, AccountNumber from tmp").take(2) replace those with your column names. they are mapped using case class HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/pr

Re: can not show all data for this table

2016-06-15 Thread Mich Talebzadeh
at last some progress :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 15 June 2016 at 10:5

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Mich Talebzadeh
Have you looked at spark GUI to see what it is waiting for. is that available memory. What is the resource manager you are using? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-04 Thread Mich Talebzadeh
--conf "spark.ui.port=4040" \ ${JAR_FILE} The spark GUI UI port is 4040 (the default). Just track the progress of the job. You can specify your own port by replacing 4040 by a nom used port value Try it anyway. HTH Dr Mich Talebzadeh LinkedIn *

Re: Analyzing twitter data

2016-06-08 Thread Mich Talebzadeh
ollow">Twitter for iPhone itter.com/download/android" rel="nofollow">Twitter for Android Free Lyft credit with Lyft promo code LYFTLUSHpp.com" rel="nofollow">Buffer naliar for iPad third person 男子南ことりが大好きなラブライバーです! ラブライブ大好きな人ぜひフォローしてください 固定ツイートお願いします ラブライブに出会え

Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
memory/heap/cpu etc HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 29 May 2016 at 00:26

Re: join function in a loop

2016-05-28 Thread Mich Talebzadeh
You are welcome Also use can use OS command /usr/bin/free to see how much free memory you have on each node. You should also see from SPARK GUI (first job on master node:4040, next on 4041etc) the resource and Storage (memory usage) for each SparkSubmit job. HTH Dr Mich Talebzadeh

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
to NOT to use local mode in prod. Others may have different opinions on this. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
is to try running the first one, Check Web GUI on 4040 to see the progress of this Job. If you start the next JVM then assuming it is working, it will be using port 4041 and so forth. In actual fact try the command "free" to see how much free memory you have. HTH Dr Mich Talebzadeh

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
OK that is good news. So briefly how do you kick off spark-submit for each (or sparkConf). In terms of memory/resources allocations. Now what is the output of /usr/bin/free Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 28 May 2016 at 17:41, Ted Yu <yuzhih...@gmail.c

Re: local Vs Standalonecluster production deployment

2016-05-28 Thread Mich Talebzadeh
ok they are submitted but the latter one 14302 is it doing anything? can you check it with jmonitor or the logs created HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: JDBC Cluster

2016-05-30 Thread Mich Talebzadeh
GUI [image: Inline images 1] Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 30 May 2016 at 10:1

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-29 Thread Mich Talebzadeh
but of course Spark has both plus in-memory capability. It would be interesting to see what version of TEZ works as execution engine with Hive. Vendors are divided on this (use Hive with TEZ) or use Impala instead of Hive etc as I am sure you already know. Cheers, Dr Mich Talebzadeh LinkedIn

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Mich Talebzadeh
data). 80-20 rule? In reality may be just 2TB or most recent partitions etc. The rest is cold data. cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: equvalent beewn join sql and data frame

2016-05-30 Thread Mich Talebzadeh
"amount_sold").as("TotalSales")) val rs = s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales")) HTH Dr Mich Talebzadeh LinkedIn * https://w

Re: Pros and Cons

2016-05-27 Thread Mich Talebzadeh
Hi Teng, what version of spark are using as the execution engine. are you using a vendor's product here? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

Re: JDBC Create Table

2016-05-27 Thread Mich Talebzadeh
are you using JDBC in spark shell Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 27 May 2016

Re: Pros and Cons

2016-05-27 Thread Mich Talebzadeh
Hi Ted, do you mean Hive 2 with spark 2 snapshot build as the execution engine just binaries for snapshot (all ok)? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view

  1   2   3   4   5   6   7   8   9   10   >