Re: Merging Parquet Files

2020-09-03 Thread Michael Segel
Hi, I think you’re asking the right question, however you’re making an assumption that he’s on the cloud and he never talked about the size of the file. It could be that he’s got a lot of small-ish data sets. 1GB is kinda small in relative terms. Again YMMV. Personally if you’re going t

Re: schema change for structured spark streaming using jsonl files

2018-04-25 Thread Michael Segel
Hi, This is going to sound complicated. Taken as an individual JSON document, because its a self contained schema doc, its structured. However there isn’t a persisting schema that has to be consistent across multiple documents. So you can consider it semi structured. If you’re parsing the JS

Re: Reading Hive RCFiles?

2018-01-29 Thread Michael Segel
2018 at 5:02 PM, Michael Segel mailto:msegel_had...@hotmail.com>> wrote: No idea on how that last line of garbage got in the message. > On Jan 18, 2018, at 9:32 AM, Michael Segel > mailto:msegel_had...@hotmail.com>> wrote: > > Hi, > > I’m trying to find out if th

Re: Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
No idea on how that last line of garbage got in the message. > On Jan 18, 2018, at 9:32 AM, Michael Segel wrote: > > Hi, > > I’m trying to find out if there’s a simple way for Spark to be able to read > an RCFile. > > I know I can create a table in Hive, then d

Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
Hi, I’m trying to find out if there’s a simple way for Spark to be able to read an RCFile. I know I can create a table in Hive, then drop the files in to that directory and use a sql context to read the file from Hive, however I wanted to read the file directly. Not a lot of details to go

Apache Spark documentation on mllib's Kmeans doesn't jibe.

2017-12-13 Thread Michael Segel
Hi, Just came across this while looking at the docs on how to use Spark’s Kmeans clustering. Note: This appears to be true in both 2.1 and 2.2 documentation. The overview page: https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means Here’ the example contains the following line: val

Re: [Spark Context]: How to add on demand jobs to an existing spark context?

2017-02-07 Thread Michael Segel
Why couldn’t you use the spark thrift server? On Feb 7, 2017, at 1:28 PM, Cosmin Posteuca mailto:cosmin.poste...@gmail.com>> wrote: answer for Gourav Sengupta I want to use same spark application because i want to work as a FIFO scheduler. My problem is that i have many jobs(not so big) and i

Quick but probably silly question...

2017-01-17 Thread Michael Segel
Hi, While the parquet file is immutable and the data sets are immutable, how does sparkSQL handle updates or deletes? I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, and then persist it, I now have two files. If I reread the table back in … will I see duplicates

Re: Spark/Parquet/Statistics question

2017-01-17 Thread Michael Segel
Hi, Lexicographically speaking, Min/Max should work because String(s) support a comparator operator. So anything which supports an equality test (<,>, <= , >= , == …) can also support min and max functions as well. I guess the question is if Spark does support this, and if not, why? Yes, it

Re: importing data into hdfs/spark using Informatica ETL tool

2016-11-09 Thread Michael Segel
Oozie, a product only a mad Russian would love. ;-) Just say no to hive. Go from Flat to Parquet. (This sounds easy, but there’s some work that has to occur…) Sorry for being cryptic, Mich’s question is pretty much generic for anyone building a data lake so it ends up overlapping with some work

Re: Save a spark RDD to disk

2016-11-09 Thread Michael Segel
Can you increase the number of partitions and also increase the number of executors? (This should improve the parallelization but you may become disk i/o bound) On Nov 8, 2016, at 4:08 PM, Elf Of Lothlorein mailto:redarro...@gmail.com>> wrote: Hi I am trying to save a RDD to disk and I am using

Re: sanboxing spark executors

2016-11-08 Thread Michael Segel
Not that easy of a problem to solve… Can you impersonate the user who provided the code? I mean if Joe provides the lambda function, then it runs as Joe so it has joe’s permissions. Steve is right, you’d have to get down to your cluster’s security and authenticate the user before accepting

Re: Spark Streaming backpressure weird behavior/bug

2016-11-07 Thread Michael Segel
Spark inherits its security from the underlying mechanisms in either YARN or MESOS (whichever environment you are launching your cluster/jobs) That said… there is limited support from Ranger. There are three parts to this… 1) Ranger being called when the job is launched… 2) Ranger being calle

How sensitive is Spark to Swap?

2016-11-07 Thread Michael Segel
This may seem like a silly question, but it really isn’t. In terms of Map/Reduce, its possible to over subscribe the cluster because there is a lack of sensitivity if the servers swap memory to disk. In terms of HBase, which is very sensitive, swap doesn’t just kill performance, but also can k

Re: Quirk in how Spark DF handles JSON input records?

2016-11-03 Thread Michael Segel
quot;, x, flags=re.UNICODE)) // convert the rdd to dataframe. If you have your own schema, this is where you should add it. df = spark.read.json(js) Assaf. From: Michael Segel [mailto:msegel_had...@hotmail.com] Sent: Wednesday, November 02, 2016 9:39 PM To: Daniel Siegmann Cc: user @spark Subje

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
On Nov 2, 2016, at 2:22 PM, Daniel Siegmann mailto:dsiegm...@securityscorecard.io>> wrote: Yes, it needs to be on a single line. Spark (or Hadoop really) treats newlines as a record separator by default. While it is possible to use a different string as a record separator, what would you use i

Re: Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
ARGH!! Looks like a formatting issue. Spark doesn’t like ‘pretty’ output. So then the entire record which defines the schema has to be a single line? Really? On Nov 2, 2016, at 1:50 PM, Michael Segel mailto:msegel_had...@hotmail.com>> wrote: This may be a silly mistake on my part… Do

Quirk in how Spark DF handles JSON input records?

2016-11-02 Thread Michael Segel
This may be a silly mistake on my part… Doing an example using Chicago’s Crime data.. (There’s a lot of it going around. ;-) The goal is to read a file containing a JSON record that describes the crime data.csv for ingestion into a data frame, then I want to output to a Parquet file. (Pretty s

Re: spark with kerberos

2016-10-18 Thread Michael Segel
Loughran mailto:ste...@hortonworks.com>> wrote: On 17 Oct 2016, at 22:11, Michael Segel mailto:michael_se...@hotmail.com>> wrote: @Steve you are going to have to explain what you mean by ‘turn Kerberos on’. Taken one way… it could mean making cluster B secure and running Kerber

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
ss. On 17 Oct 2016, at 23:02, Michael Segel mailto:msegel_had...@hotmail.com>> wrote: You really don’t want to do OLTP on a distributed NoSQL engine. Remember Big Data isn’t relational its more of a hierarchy model or record model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, e

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
's and cons about the use of it. On 18 Oct 2016 03:17, "Michael Segel" mailto:msegel_had...@hotmail.com>> wrote: Guys, Sorry for jumping in late to the game… If memory serves (which may not be a good thing…) : You can use HiveServer2 as a connection point to HBase. While t

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
kowski mailto:vincent.gromakow...@gmail.com>> Date: Monday, October 17, 2016 at 1:53 PM To: Benjamin Kim mailto:bbuil...@gmail.com>> Cc: Michael Segel mailto:msegel_had...@hotmail.com>>, Jörn Franke mailto:jornfra...@gmail.com>>, Mich Talebzadeh mailto:mich.talebz

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
store... It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job. 2016-10-17 18:17 GMT+02:00 Michael Segel mailto:msegel_had...@hotmail.com>>:

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
.. It's not part of the public API and I don't know yet what are the issues doing this but I think Spark community should look at this path: making the thriftserver be instantiable in any spark job. 2016-10-17 18:17 GMT+02:00 Michael Segel mailto:msegel_had...@hotmail.com>>: Guys, S

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Michael Segel
Mitch, Short answer… no, it doesn’t scale. Longer answer… You are using an UUID as the row key? Why? (My guess is that you want to avoid hot spotting) So you’re going to have to pull in all of the data… meaning a full table scan… and then perform a sort order transformation, dropping the UU

Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi, Apologies if I’ve asked this question before but I didn’t see it in the list and I’m certain that my last surviving brain cell has gone on strike over my attempt to reduce my caffeine intake… Posting this to both user and dev because I think the question / topic jumps in to both camps. A

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Guys, Sorry for jumping in late to the game… If memory serves (which may not be a good thing…) : You can use HiveServer2 as a connection point to HBase. While this doesn’t perform well, its probably the cleanest solution. I’m not keen on Phoenix… wouldn’t recommend it…. The issue is that you’re

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
pointless. > > On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel > wrote: >> Spark standalone is not Yarn… or secure for that matter… ;-) >> >>> On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: >>> >>> Spark streaming helps with aggregation beca

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
of sync, leading to lost / > duplicate data. > > Regarding long running spark jobs, I have streaming jobs in the > standalone manager that have been running for 6 months or more. > > On Thu, Sep 29, 2016 at 11:01 AM, Michael Segel > wrote: >> Ok… so what’s the tricky part?

Fwd: tod...@yahoo-inc.com is no longer with Yahoo! (was: Re: Treadting NaN fields in Spark)

2016-09-29 Thread Michael Segel
Hi, Hate to be a pain… but could someone remove this email address (see below) from the spark mailing list(s) It seems that ‘Elvis’ has left the building and forgot to change his mail subscriptions… Begin forwarded message: From: Yahoo! No Reply mailto:postmas...@yahoo-inc.com>> Subject: tod..

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work. The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does Y

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
gt; Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages ari

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
Hi, Just a few thoughts so take it for what its worth… Databases have static schemas and will reject a row’s column on insert. In your case… you have one data set where you have a column which is supposed to be a number but you have it as a string. You want to convert this to a double in your f

Re: Spark Hive Rejection

2016-09-29 Thread Michael Segel
Correct me if I’m wrong but isn’t hive schema on read and not on write? So you shouldn’t fail on write. On Sep 29, 2016, at 1:25 AM, Mostafa Alaa Mohamed mailto:mohamedamost...@etisalat.ae>> wrote: Dears, I want to ask • What will happened if there are rejections rows when inserting da

Re: building runnable distribution from source

2016-09-29 Thread Michael Segel
You may want to replace the 2.4 with a later release. On Sep 29, 2016, at 3:08 AM, AssafMendelson mailto:assaf.mendel...@rsa.com>> wrote: Hi, I am trying to compile the latest branch of spark in order to try out some code I wanted to contribute. I was looking at the instructions to build from

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
er processes > fail than grind everything to a halt. You'd buy more memory or > optimize memory before trading it for I/O. > > On Thu, Sep 22, 2016 at 6:29 PM, Michael Segel > wrote: >> Ok… gotcha… wasn’t sure that YARN just looked at the heap size allocation >> and i

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
at 9:56 AM, Sean Owen wrote: > > It's looking at the whole process's memory usage, and doesn't care > whether the memory is used by the heap or not within the JVM. Of > course, allocating memory off-heap still counts against you at the OS > level. > > On Thu,

Re: Off Heap (Tungsten) Memory Usage / Management ?

2016-09-22 Thread Michael Segel
ameter of the JVM. This can >> be configured via spark options for yarn (be aware that they are different >> in cluster and client mode), but i recommend to use the spark options for >> the off heap maximum. >> >> https://spark.apache.org/docs/latest/running-on-yarn.html &

Off Heap (Tungsten) Memory Usage / Management ?

2016-09-21 Thread Michael Segel
I’ve asked this question a couple of times from a friend who didn’t know the answer… so I thought I would try here. Suppose we launch a job on a cluster (YARN) and we have set up the containers to be 3GB in size. What does that 3GB represent? I mean what happens if we end up using 2-3GB of

Re: Spark Thrift Server performance

2016-07-13 Thread Michael Segel
Hey, silly question? If you’re running a load balancer, are you trying to reuse the RDDs between jobs? TIA -Mike > On Jul 13, 2016, at 9:08 AM, ayan guha > wrote: > > My 2 cents: > > Yes, we are running multiple STS (we are running on different nodes, but you >

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-13 Thread Michael Segel
I believe that there is one JVM for the Thrift Service and that there is only one context for the service. This would allow you to share RDDs across multiple jobs, however… not so great for security. HTH… > On Jul 10, 2016, at 10:05 PM, Takeshi Yamamuro > wrot

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. Tez is ‘vendor’ independent. ;-) Yeah… I know… Anyone can support it. Only Hortonworks has stacked the deck in their favor. Drill could be in the same boat, although there now more committers who are not working for MapR. I’m not sure who outside of HW is supporting

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
I don’t think that it would be a good comparison. If memory serves, Tez w LLAP is going to be running a separate engine that is constantly running, no? Spark? That runs under hive… Unless you’re suggesting that the spark context is constantly running as part of the hiveserver2? > On May

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
asic data from the DB2 tables > but afterwards I'm pretty free to transform the data as needed. > > > > On 6. Juli 2016 um 22:12:26 MESZ, Michael Segel > wrote: >> I think you need to learn the basics of how to build a ‘data >> lake/pond/sewer’ first. &g

Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Michael Segel
I think you need to learn the basics of how to build a ‘data lake/pond/sewer’ first. The short answer is yes. The longer answer is that you need to think more about translating a relational model in to a hierarchical model, something that I seriously doubt has been taught in schools in a very

Re: Spark application doesn't scale to worker nodes

2016-07-05 Thread Michael Segel
Did the OP say he was running a stand alone cluster of Spark, or on Yarn? > On Jul 5, 2016, at 10:22 AM, Mich Talebzadeh > wrote: > > Hi Jakub, > > Any reason why you are running in standalone mode, given that your are > familiar with YARN? > > In theory your settings are correct. I checke

Re: Joining a compressed ORC table with a non compressed text table

2016-06-29 Thread Michael Segel
Hi, I’m not sure I understand your initial question… Depending on the compression algo, you may or may not be able to split the file. So if its not splittable, you have a single long running thread. My guess is that you end up with a very long single partition. If so, if you repartition, y

Re: Spark Thrift Server Concurrency

2016-06-23 Thread Michael Segel
Hi, There are a lot of moving parts and a lot of unknowns from your description. Besides the version stuff. How many executors, how many cores? How much memory? Are you persisting (memory and disk) or just caching (memory) During the execution… same tables… are you seeing a lot of shufflin

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
The only documentation on this… in terms of direction … (that I could find) If your client is not close to the cluster (e.g. your PC) then you definitely want to go cluster to improve performance. If your client is close to the cluster (e.g. an edge node) then you could go either client or clust

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 22 June 2016 at 19:04, Michael Segel <mailto:msegel_had...@hotmail.

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
rbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 22 June 2016 at 15:59, Michael Segel <mailto:msegel_ha

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
s different execution engines (TEZ, Spark), formats (Orc, > parquet) and further optimizations to make the analysis fast. It always > depends on your use case. > > On 22 Jun 2016, at 05:47, Michael Segel <mailto:msegel_had...@hotmail.com>> wrote: > >> >> Sorry,

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
> hive metastore and streams data directly from HDFS. Hive MR jobs do not play > any role here, making spark faster than hive. > > HTH > > Ayan > > On Wed, Jun 22, 2016 at 9:58 AM, Michael Segel <mailto:msegel_had...@hotmail.com>> wrote: > Ok, its at th

Re: Union of multiple RDDs

2016-06-21 Thread Michael Segel
By repartition I think you mean coalesce() where you would get one parquet file per partition? And this would be a new immutable copy so that you would want to write this new RDD to a different HDFS directory? -Mike > On Jun 21, 2016, at 8:06 AM, Eugene Morozov > wrote: > > Apurva, > >

Silly question about Yarn client vs Yarn cluster modes...

2016-06-21 Thread Michael Segel
Ok, its at the end of the day and I’m trying to make sure I understand the locale of where things are running. I have an application where I have to query a bunch of sources, creating some RDDs and then I need to join off the RDDs and some other lookup tables. Yarn has two modes… client and

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
> On Jun 8, 2016, at 3:35 PM, Eugene Koifman wrote: > > if you split “create table test.dummy as select * from oraclehadoop.dummy;” > into create table statement, followed by insert into test.dummy as select… > you should see the behavior you expect with Hive. > Drop statement will block while

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
you try a select * from foo; and in another shell try dropping foo? and if you want to simulate a m/r job add something like an order by 1 clause. HTH -Mike > On Jun 8, 2016, at 2:36 PM, Michael Segel wrote: > > Hi, > > Lets take a step back… > > Which version of Hive

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
ile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 30 May 2016 at 20:19, Michael Segel <mailto:msegel_had...@hotmail.com>> wrote: > Mich, > > Most people use vendor releases because they

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, Most people use vendor releases because they need to have the support. Hortonworks is the vendor who has the most skin in the game when it comes to Tez. If memory serves, Tez isn’t going to be M/R but a local execution engine? Then LLAP is the in-memory piece to speed up Tez? HTH -M

Re: HiveContext standalone => without a Hive metastore

2016-05-30 Thread Michael Segel
Going from memory… Derby is/was Cloudscape which IBM acquired from Informix who bought the company way back when. (Since IBM released it under Apache licensing, Sun Microsystems took it and created JavaDB…) I believe that there is a networking function so that you can either bring it up in st

Re: Secondary Indexing?

2016-05-30 Thread Michael Segel
w?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 30 May 2016 at 17:08, Michael Segel <mailto:msegel_had...@hotmail.com>> wrote: > I’m not sure where to post this since its a bit of a philosophical qu

Secondary Indexing?

2016-05-30 Thread Michael Segel
I’m not sure where to post this since its a bit of a philosophical question in terms of design and vision for spark. If we look at SparkSQL and performance… where does Secondary indexing fit in? The reason this is a bit awkward is that if you view Spark as querying RDDs which are temporary, i

Re: SPARK - DataFrame for BulkLoad

2016-05-18 Thread Michael Segel
Yes, but he’s using phoenix which may not work cleanly with your HBase spark module. They key issue here may be Phoenix which is separate from HBase. > On May 18, 2016, at 5:36 AM, Ted Yu wrote: > > Please see HBASE-14150 > > The hbase-spark module would be available in the upcoming hbase 2

Re: Silly Question on my part...

2016-05-17 Thread Michael Segel
uted Query Engine connector. > > 2016-05-17 5:12 GMT+10:00 Michael Segel <mailto:msegel_had...@hotmail.com>>: > For one use case.. we were considering using the thrift server as a way to > allow multiple clients access shared RDDs. > > Within the Thrift Context, we create an RD

Silly Question on my part...

2016-05-16 Thread Michael Segel
For one use case.. we were considering using the thrift server as a way to allow multiple clients access shared RDDs. Within the Thrift Context, we create an RDD and expose it as a hive table. The question is… where does the RDD exist. On the Thrift service node itself, or is that just a ref

Re: removing header from csv file

2016-05-03 Thread Michael Segel
Hi, Another silly question… Don’t you want to use the header line to help create a schema for the RDD? Thx -Mike > On May 3, 2016, at 8:09 AM, Mathieu Longtin wrote: > > This only works if the files are "unsplittable". For example gzip files, each > partition is one file (if you have more

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Michael Segel
Silly question? If you change the predicate to ( s.date >= ‘2016-01-03’ OR s.date IS NULL ) AND (d.date >= ‘2016-01-03’ OR d.date IS NULL) What do you get? Sorry if the syntax isn’t 100% correct. The idea is to not drop null values from the query. I would imagine that this shouldn’t kill

Re: Spark support for Complex Event Processing (CEP)

2016-04-29 Thread Michael Segel
data). This solution is suitable for very complex > (targeted) analyzing. It can be too slow and memory-consuming, but well done > pre-processing of log data can help a lot. > > --- > Esa Heikkinen > > 28.4.2016, 14:44, Michael Segel kirjoitti: >> I don’t. >> >&g

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
ter in financial trading but rarely. > HTH > > > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> &g

Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Michael Segel
nen > wrote: > > > Do you know any good examples how to use Spark streaming in tracking public > transportation systems ? > > Or Storm or some other tool example ? > > Regards > Esa Heikkinen > > 28.4.2016, 3:16, Michael Segel kirjoitti: >> Uhm… >

Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel
Doh! Wrong email account again! > Begin forwarded message: > > From: Michael Segel > Subject: Re: Spark support for Complex Event Processing (CEP) > Date: April 27, 2016 at 7:16:55 PM CDT > To: Mich Talebzadeh > Cc: Esa Heikkinen , "user@spark" > > &g

Fwd: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Michael Segel
Sorry sent from wrong email address. > Begin forwarded message: > > From: Michael Segel > Subject: Re: Spark support for Complex Event Processing (CEP) > Date: April 27, 2016 at 7:51:14 AM CDT > To: Mich Talebzadeh > Cc: Esa Heikkinen , "user @spark" &g

Re: Spark SQL Transaction

2016-04-21 Thread Michael Segel
Hi, Sometimes terms get muddled over time. If you’re not using transactions, then each database statement is atomic and is itself a transaction. So unless you have some explicit ‘Begin Work’ at the start…. your statements should be atomic and there will be no ‘redo’ or ‘commit’ or ‘rollback’.

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-21 Thread Michael Segel
How many partitions in your data set. Per the Spark DataFrameWritetr Java Doc: “ Saves the content of the DataFrame to a external database table via JDBC. In the case the table already exists in the external da

Re: inter spark application communication

2016-04-18 Thread Michael Segel
ant to chain spark applications. > On Mon, Apr 18, 2016 at 4:46 PM Michael Segel <mailto:msegel_had...@hotmail.com>> wrote: > Yes, but I’m confused. Are you chaining your spark jobs? So you run job one > and its output is the input to job 2? > >> On A

Re: How to start HDFS on Spark Standalone

2016-04-18 Thread Michael Segel
Perhaps this is a silly question on my part…. Why do you want to start up HDFS on a single node? You only mention one windows machine in your description of your cluster. If this is a learning experience, why not run Hadoop in a VM (MapR and I think the other vendors make linux images that ca

Re: inter spark application communication

2016-04-18 Thread Michael Segel
have you thought about Akka? What are you trying to send? Why do you want them to talk to one another? > On Apr 18, 2016, at 12:04 PM, Soumitra Johri > wrote: > > Hi, > > I have two applications : App1 and App2. > On a single cluster I have to spawn 5 instances os App1 and 1 instance of >

Re: Silly question...

2016-04-13 Thread Michael Segel
8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 12 April 2016 at 23:42, Michael Segel <mailto:msegel_had...@hotmail.com>>

Fwd: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Segel
Sorry for duplicate(s), I forgot to switch my email address. > Begin forwarded message: > > From: Michael Segel > Subject: Re: Can i have a hive context and sql context in the same app ? > Date: April 12, 2016 at 4:05:26 PM MST > To: Michael Armbrust > Cc: N

Silly question...

2016-04-12 Thread Michael Segel
Hi, This is probably a silly question on my part… I’m looking at the latest (spark 1.6.1 release) and would like to do a build w Hive and JDBC support. From the documentation, I see two things that make me scratch my head. 1) Scala 2.11 "Spark does not yet support its JDBC component for Sca

Re: Sqoop on Spark

2016-04-11 Thread Michael Segel
t; Dr Mich Talebzadeh >> >> LinkedIn >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >> >> http://talebzadehmich.wordpress.com <http://talebzadehmic

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel
om>> wrote: > Well I am not sure, but using a database as a storage, such as relational > databases or certain nosql databases (eg MongoDB) for Spark is generally a > bad idea - no data locality, it cannot handle real big data volumes for > compute and you may potentially overload

Re: Sqoop on Spark

2016-04-06 Thread Michael Segel
I don’t think its necessarily a bad idea. Sqoop is an ugly tool and it requires you to make some assumptions as a way to gain parallelism. (Not that most of the assumptions are not valid for most of the use cases…) Depending on what you want to do… your data may not be persisted on HDFS. The

Re: Relation between number of partitions and cores.

2016-04-01 Thread Michael Segel
There’s a mix of terms here. CPU is the physical chip which most likely contains more than 1 physical core. If you’re on Intel, there are physical cores and virtual cores. So 1 physical core is seen by the OS as two virtual cores. Then there are ‘cores per executor’ (spark terminology). So

Re: Unable to Limit UI to localhost interface

2016-03-30 Thread Michael Segel
It sounds like when you start up spark, its using 0.0.0.0 which means it will listen on all interfaces. You should be able to limit which interface to use. The weird thing is that if you are specifying the IP Address and Port, Spark shouldn’t be listening on all of the interfaces for that po

Fwd: Spark and N-tier architecture

2016-03-29 Thread Michael Segel
> Begin forwarded message: > > From: Michael Segel > Subject: Re: Spark and N-tier architecture > Date: March 29, 2016 at 4:16:44 PM MST > To: Alexander Pivovarov > Cc: Mich Talebzadeh , Ashok Kumar > , User > > So… > > Is spark-jobserver an offi

Re: Spark SQL Json Parse

2016-03-03 Thread Michael Segel
Why do you want to write out NULL if the column has no data? Just insert the fields that you have. > On Mar 3, 2016, at 9:10 AM, barisak wrote: > > Hi, > > I have a problem with Json Parser. I am using spark streaming with > hiveContext for keeping json format tweets. The flume collects twee

Re: temporary tables created by registerTempTable()

2016-02-15 Thread Michael Segel
I was just looking at that… Out of curiosity… if you make it a Hive Temp Table… who has access to the data? Just your app, or anyone with access to the same database? (Would you be able to share data across different JVMs? ) (E.G - I have a reader who reads from source A that needs to publi

Re: Spark Job Server with Yarn and Kerberos

2016-01-04 Thread Michael Segel
Its been a while... but this isn’t a spark issue. A spark job on YARN runs as a regular job. What happens when you run a regular M/R job by that user? I don’t think we did anything special... > On Jan 4, 2016, at 12:22 PM, Mike Wright > wrote: > > Has anyone used S

Re: stopping a process usgin an RDD

2016-01-04 Thread Michael Segel
Not really a good idea. It breaks the paradigm. If I understand the OP’s idea… they want to halt processing the RDD, but not the entire job. So when it hits a certain condition, it will stop that task yet continue on to the next RDD. (Assuming you have more RDDs or partitions than you have t

Re: TCP/IP speedup

2015-08-02 Thread Michael Segel
This may seem like a silly question… but in following Mark’s link, the presentation talks about the TPC-DS benchmark. Here’s my question… what benchmark results? If you go over to the TPC.org website they have no TPC-DS benchmarks listed. (Either audited or unaudited) So

Re: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com/> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com <http://polyglotprogramming.com

Fwd: Silly question about building Spark 1.4.1

2015-07-20 Thread Michael Segel
Sorry, Should have sent this to user… However… it looks like the docs page may need some editing? Thx -Mike > Begin forwarded message: > > From: Michael Segel > Subject: Silly question about building Spark 1.4.1 > Date: July 20, 2015 at 12:26:40 PM MST > To: d..

Re: spark streaming job to hbase write

2015-07-16 Thread Michael Segel
You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than

Re: Research ideas using spark

2015-07-16 Thread Michael Segel
015, at 12:40 PM, vaquar khan wrote: > > I would suggest study spark ,flink,strom and based on your understanding and > finding prepare your research paper. > > May be you will invented new spark ☺ > > Regards, > Vaquar khan > > On 16 Jul 2015 00:47, &q

Re: Research ideas using spark

2015-07-15 Thread Michael Segel
Silly question… When thinking about a PhD thesis… do you want to tie it to a specific technology or do you want to investigate an idea but then use a specific technology. Or is this an outdated way of thinking? "I am doing my PHD thesis on large scale machine learning e.g Online learning,

Re: Spark performance

2015-07-12 Thread Michael Segel
Not necessarily. It depends on the use case and what you intend to do with the data. 4-6 TB will easily fit on an SMP box and can be efficiently searched by an RDBMS. Again it depends on what you want to do and how you want to do it. Informix’s IDS engine with its extensibility could still o

Re: Spark or Storm

2015-06-17 Thread Michael Segel
Actually the reverse. Spark Streaming is really a micro batch system where the smallest window is 1/2 a second (500ms). So for CEP, its not really a good idea. So in terms of options…. spark streaming, storm, samza, akka and others… Storm is probably the easiest to pick up, spark streaming

Re: HW imbalance

2015-01-30 Thread Michael Segel
eneously sized executors won't be able to take advantage > of the extra memory on the bigger boxes. > > Cloudera Manager can certainly configure YARN with different resource > profiles for different nodes if that's what you're wondering. > > -Sandy > > On T

Re: HW imbalance

2015-01-29 Thread Michael Segel
me amount of memory. It's possibly to configure YARN with different > amounts of memory for each host (using yarn.nodemanager.resource.memory-mb), > so other apps might be able to take advantage of the extra memory. > > -Sandy > > On Mon, Jan 26, 2015 at 8:34 AM, Michael S

  1   2   >