Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Jörn Franke
You could also explore the in-memory database of 12c . However, I am not sure how beneficial it is for Oltp scenarios. I am excited to see how the performance will be on hbase as a hive metastore. Nevertheless, your results on Oracle/SSD will be beneficial for the community. > On 17 Apr 2016,

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
You probably read this benchmark at Yahoo, any comments from Spark? https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at > On 17 Apr 2016, at 12:41, andy

Spark support for Complex Event Processing (CEP)

2016-04-17 Thread Mich Talebzadeh
Hi, Has Spark got libraries for CEP using Spark Streaming with Kafka by any chance? I am looking at Flink that supposed to have these libraries for CEP but I find Flink itself very much work in progress. Thanks Dr Mich Talebzadeh LinkedIn *

Moving Hive metastore to Solid State Disks

2016-04-17 Thread Mich Talebzadeh
Hi, I have had my Hive metastore database on Oracle 11g supporting concurrency (with added transactional capability) Over the past few days I created a new schema on Oracle 12c on Solid State Disks (SSD) and used databump (exdp, imdp) to migrate Hive database from Oracle 11g to Oracle 12c on

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
It seems that Flink argues that the latency for streaming data is eliminated whereas with Spark RDD there is this latency. I noticed that Flink does not support interactive shell much like Spark shell where you can add jars to it to do kafka testing. The advice was to add the streaming Kafka jar

Re: Apache Flink

2016-04-17 Thread Igor Berman
latency in Flink is not eliminated, but it might be smaller since Flink process each event 1-by-1 while Spark does microbatching(so you can't achieve latency lesser than your microbatch config) probably Spark will have better throughput due to this microbatching On 17 April 2016 at 14:47,

Re: Apache Flink

2016-04-17 Thread andy petrella
Just adding one thing to the mix: `that the latency for streaming data is eliminated` is insane :-D On Sun, Apr 17, 2016 at 12:19 PM Mich Talebzadeh wrote: > It seems that Flink argues that the latency for streaming data is > eliminated whereas with Spark RDD there

RE: Apache Flink

2016-04-17 Thread Silvio Fiorito
Actually there were multiple responses to it on the GitHub project, including a PR to improve the Spark code, but they weren’t acknowledged. From: Ovidiu-Cristian MARCU Sent: Sunday, April 17, 2016 7:48 AM To: andy petrella

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
For the streaming case Flink is fault tolerant (DataStream API), for the batch case (DataSet API) not yet, as from my research regarding their platform. > On 17 Apr 2016, at 17:03, Koert Kuipers wrote: > > i never found much info that flink was actually designed to be fault

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
The Streaming use case is important IMO, as Spark (like Flink) advocates for the unification of analytics tools, so having all in one, batch and graph processing, sql, ml and streaming. > On 17 Apr 2016, at 17:07, Corey Nolet wrote: > > One thing I've noticed about Flink in

Re: Moving Hive metastore to Solid State Disks

2016-04-17 Thread Mich Talebzadeh
Hi Jorn, Sure will do. What Oracle in-memory offering does is allow the user to store a *copy* of selected tables, or partitions, in*columnar* format in-memory within the Oracle Database memory space. All tables are still present in row format and all copies on storage are in row format. These

Re: JSON Usage

2016-04-17 Thread Hyukjin Kwon
Hi! Personally, I don't think it necessarily needs to be DataSet for your goal. Just select your data at "s3" from DataFrame loaded by sqlContext.read.json(). You can try to printSchema() to check the nested schema and then select the data. Also, I guess (from your codes) you are trying to

Re: Apache Flink

2016-04-17 Thread Koert Kuipers
i never found much info that flink was actually designed to be fault tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that doesn't bode well for large scale data processing. spark was designed with fault tolerance in mind from the beginning. On Sun, Apr 17, 2016 at 9:52 AM,

Re: Spark support for Complex Event Processing (CEP)

2016-04-17 Thread Luciano Resende
Hi Mitch, I know some folks that were investigating/prototyping on this area, let me see if I can get them to reply here with more details. On Sun, Apr 17, 2016 at 1:54 AM, Mich Talebzadeh wrote: > Hi, > > Has Spark got libraries for CEP using Spark Streaming with

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Hi, I read the benchmark published by Yahoo. Obviously they already use Storm and inevitably very familiar with that tool. To start with although these benchmarks were somehow interesting IMO, it lend itself to an assurance that the tool chosen for their platform is still the best choice. So

Re: Apache Flink

2016-04-17 Thread Corey Nolet
One thing I've noticed about Flink in my following of the project has been that it has established, in a few cases, some novel ideas and improvements over Spark. The problem with it, however, is that both the development team and the community around it are very small and many of those novel

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
Hi Mich, IMO one will try to see if there is an alternative, a better one at least. This benchmark could be a good starting point. Best, Ovidiu > On 17 Apr 2016, at 15:52, Mich Talebzadeh wrote: > > Hi, > > I read the benchmark published by Yahoo. Obviously they

Docker Mesos Spark Port Mapping

2016-04-17 Thread John Omernik
The setting spark.mesos.executor.docker.portmaps Is interesting to me, without this setting, the docker executor uses net=host and thus port mappings are not needed. With this setting, (and just adding some random mappings) my executors fail with less then helpful messages. I guess some

Access to Mesos Docker Cmd for Spark Executors

2016-04-17 Thread John Omernik
Hey all, I was wondering if there is a way to access/edit the command on Spark Executors while using Docker on Mesos. The reason is this: I am using the MapR File Client, and the Spark Driver is trying to execute things as my user "user1" and since the executors are running as root inside and

Re: JSON Usage

2016-04-17 Thread Benjamin Kim
Hyukjin, This is what I did so far. I didn’t use DataSet yet or maybe I don’t need to. var df: DataFrame = null for(message <- messages) { val bodyRdd = sc.parallelize(message.getBody() :: Nil) val fileDf = sqlContext.read.json(bodyRdd) .select(

Re: Apache Flink

2016-04-17 Thread Ovidiu-Cristian MARCU
Yes, mostly regarding spark partitioning and use of groupByKey instead of reduceByKey. However, Flink extended the benchmark here http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ So I was curious about an

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Thanks Corey for the useful info. I have used Sybase Aleri and StreamBase as commercial CEPs engines. However, there does not seem to be anything close to these products in Hadoop Ecosystem. So I guess there is nothing there? Regards. Dr Mich Talebzadeh LinkedIn *

Re: Apache Flink

2016-04-17 Thread Mark Hamstra
To be fair, the Stratosphere project from which Flink springs was started as a collaborative university research project in Germany about the same time that Spark was first released as Open Source, so they are near contemporaries rather than Flink having been started only well after Spark was an

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
The stats are only for one file in one partition. There is 17970737 rows in total. The table is not bucketed. The problem is not inserting rows, the problem is with this SQL query: “SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as `dealviews` FROM myTable

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
hang on so it takes 15 seconds to switch the database context with HiveContext.sql("use myDatabase") ? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark support for Complex Event Processing (CEP)

2016-04-17 Thread Mich Talebzadeh
Thanks Luciano. Appreciated. Regards Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 17 April 2016

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
Hi, I am using cloudera distribution, and when I do a" desc formatted table” I don t get all the table parameters. But I did a hive orcfiledump on one random file ( I replaced some of the values that can be sensible) : hive --orcfiledump

Re: Apache Flink

2016-04-17 Thread Michael Malak
There have been commercial CEP solutions for decades, including from my employer. From: Mich Talebzadeh To: Mark Hamstra Cc: Corey Nolet ; "user @spark" Sent: Sunday, April 17, 2016 3:48 PM

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Assuming that both Spark and Flink are contemporaries what are the reasons that Flink has not been adopted widely? (this may sound obvious and or prejudged). I mean Spark has surged in popularity in the past year if I am correct Dr Mich Talebzadeh LinkedIn *

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Also it always amazes me why they are so many tangential projects in Big Data space? Would not it be easier if efforts were spent on adding to Spark functionality rather than creating a new product like Flink? Dr Mich Talebzadeh LinkedIn *

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
Hi Maurin, Have you tried to create your table in Hive as parquet table? This table is pretty small with 100K rows. Is Hive table bucketed at all? I gather your issue is inserting rows into Hive table at the moment that taking longer time (compared to Parquet)? HTH Dr Mich Talebzadeh

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
The problem is that the strength and wider acceptance of a typical Open source project is its sizeable user and development community. When the community is small like Flink, then it is not a viable solution to adopt I am rather disappointed that no big data project can be used for Complex Event

Re: Apache Flink

2016-04-17 Thread Michael Malak
In terms of publication date, a paper on Nephele was published in 2009, prior to the 2010 USENIX paper on Spark. Nephele is the execution engine of Stratosphere, which became Flink. From: Mark Hamstra To: Mich Talebzadeh Cc: Corey

Re: Apache Flink

2016-04-17 Thread Mich Talebzadeh
Hi Corey, Can you please point me to docs on using Spark for CEP? Do we have a set of CEP libraries somewhere. I am keen on getting hold of adaptor libraries for Spark something like below ​ Thanks Dr Mich Talebzadeh LinkedIn *

Re: Apache Flink

2016-04-17 Thread Corey Nolet
i have not been intrigued at all by the microbatching concept in Spark. I am used to CEP in real streams processing environments like Infosphere Streams & Storm where the granularity of processing is at the level of each individual tuple and processing units (workers) can react immediately to

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Maurin Lenglart
Let me explain a little my architecture: I have one cluster with hive and spark. Over there I create my databases and create the tables and insert data in them. If I execute this query :self.sqlContext.sql(“SELECT `event_date` as `event_date`,sum(`bookings`) as `bookings`,sum(`dealviews`) as

Re: Apache Flink

2016-04-17 Thread Michael Malak
As with all history, "what if"s are not scientifically testable hypotheses, but my speculation is the energy (VCs, startups, big Internet companies, universities) within Silicon Valley contrasted to Germany. From: Mich Talebzadeh To: Michael Malak

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Rajesh Balamohan
1. In first case (i.e in cluster where you have hive and spark), it would have executed via HiveTableScan instead of OrcRelation. HiveTableScan would not propagate any PPD related information to ORC readers (SPARK-12998). PPD might not play a big role here as your where conditions seem to be only

Re: orc vs parquet aggregation, orc is really slow

2016-04-17 Thread Mich Talebzadeh
Ok that may explain. In another cluster you register it as temp table and then collect data using SQL running against that temp table which loads the data at that point and if you do not have enough memory for your temp table, it will have to spill it to disk and do many passes. Could that be a

strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
Hello All, We are using HashPartitioner in the following way on a 3 node cluster (1 master and 2 worker nodes). val u = sc.textFile("hdfs://x.x.x.x:8020/user/azureuser/s.txt").map[(Int, Int)](line => { line.split("\\|") match { case Array(x, y) => (y.toInt, x.toInt) } }).partitionBy(new

Re: Apache Flink

2016-04-17 Thread Otis Gospodnetić
While Flink may not be younger than Spark, Spark came to Apache first, which always helps. Plus, there was already a lot of buzz around Spark before it came to Apache. Coming from Berkeley also helps. That said, Flink seems decently healthy to me: -

Re: WELCOME to user@spark.apache.org

2016-04-17 Thread jinan_alhajjaj
Hello, I would like to know how to parse XML files using Apache spark by java language. I am doing this for my senior project and I am a beginner in Apache Spark and I have just a little experience with spark. Thank you. On Apr 18, 2016, at 3:14 AM, user-h...@spark.apache.org wrote: > Hi!

Re: Apache Flink

2016-04-17 Thread Peyman Mohajerian
Microbatching is certainly not a waste of time, you are making way too strong of an statement. In fact in certain cases one tuple at the time makes no sense, it all depends on the use cases. In fact if you understand the history of the project Storm you would know that microbatching was added

Re: Apache Flink

2016-04-17 Thread Corey Nolet
Peyman, I'm sorry, I missed the comment that microbatching was a waste of time. Did someone mention this? I know this thread got pretty long so I may have missed it somewhere above. My comment about Spark's microbatching being a downside is stricly in reference to CEP. Complex CEP flows are

Re: Apache Flink

2016-04-17 Thread Todd Nist
So there is an offering from Stratio, https://github.com/Stratio/Decision Decision CEP engine is a Complex Event Processing platform built on Spark > Streaming. > > It is the result of combining the power of Spark Streaming as a continuous > computing framework and Siddhi CEP engine as complex

Re: WELCOME to user@spark.apache.org

2016-04-17 Thread Hyukjin Kwon
Hi Jinan, There are some examples for XML here, https://github.com/databricks/spark-xml/blob/master/src/test/java/com/databricks/spark/xml/JavaXmlSuite.java for test codes. Or, you can see documentation in README.md. https://github.com/databricks/spark-xml#java-api. There are other basic Java

Fwd: [Help]:Strange Issue :Debug Spark Dataframe code

2016-04-17 Thread Divya Gehlot
Reposting again as unable to find the root cause where things are going wrong. Experts please help . -- Forwarded message -- From: Divya Gehlot Date: 15 April 2016 at 19:13 Subject: [Help]:Strange Issue :Debug Spark Dataframe code To: "user @spark"

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
If the data file is same then it should have similar distribution of keys. Few queries- 1. Did you compare the number of partitions in both the cases? 2. Did you compare the resource allocation for Spark Shell vs Scala Program being submitted? Also, can you please share the details of Spark

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
We are testing with 52MB, but it would go to 20GB and more later on. The cluster size is also not static, we would be growing it. But the issue here is the behavior of HashPartitioner -- from what I understand, it should be partitioning the data based on the hash of the key irrespective of the RAM

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
A HashPartitioner will indeed partition based on the key, but you cannot know on *which* node that key will appear. Again, the RDD partitions will not necessarily be distributed evenly across your nodes because of the greedy scheduling of the first wave of tasks, particularly if those tasks have

Fwd: Adding metadata information to parquet files

2016-04-17 Thread Manivannan Selvadurai
Just a reminder!! Hi All, I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2 and I'm looking for a way to store the source schema in the parquet file like the way you get to store the avro schema as a metadata info when using the AvroParquetWriter. Any help much

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
Few params like- spark.task.cpus, spark.cores.max will help. Also, for 52MB of data you need not have 12GB allocated to executors. Better to assign 512MB or so and increase the number of executors per worker node. Try reducing that executor memory to 512MB or so for this case. On Mon, Apr 18,

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Mike Hynes
When submitting a job with spark-submit, I've observed delays (up to 1--2 seconds) for the executors to respond to the driver in order to receive tasks in the first stage. The delay does not persist once the executors have been synchronized. When the tasks are very short, as may be your case

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Anuj Kumar
Good point Mike +1 On Mon, Apr 18, 2016 at 9:47 AM, Mike Hynes <91m...@gmail.com> wrote: > When submitting a job with spark-submit, I've observed delays (up to > 1--2 seconds) for the executors to respond to the driver in order to > receive tasks in the first stage. The delay does not persist

Re: strange HashPartitioner behavior in Spark

2016-04-17 Thread Raghava Mutharaju
Yes its the same data. 1) The number of partitions are the same (8, which is an argument to the HashPartitioner). In the first case, these partitions are spread across both the worker nodes. In the second case, all the partitions are on the same node. 2) What resources would be of interest here?