Re: Flume DStream produces 0 records after HDFS node killed

2017-06-20 Thread N B
Ok some more info about this issue to see if someone can shine a light on what could be going on. I turned on debug logging for org.apache.spark.streaming.scheduler in the driver process and this is what gets thrown in the logs and keeps throwing it even after the downed HDFS node is restarted.

Unsubscribe

2017-06-20 Thread Palash Gupta
Unsubscribe  Thanks & Best Regards, Engr. Palash Gupta Consultant, OSS/CEM/Big Data Skype: palash2494 https://www.linkedin.com/in/enggpalashgupta

Re: appendix

2017-06-20 Thread Wenchen Fan
you should make hbase a data source(seems we already have hbase connector?), create a dataframe from hbase, and do join in Spark SQL. > On 21 Jun 2017, at 10:17 AM, sunerhan1...@sina.com wrote: > > Hello, > My scenary is like this: > 1.val df=hivecontext/carboncontex.sql("sql") >

Re: org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
After investigation, it looks like my Spark 2.1.1 jars got corrupted during download - all good now... ;) > On Jun 20, 2017, at 4:14 PM, Jean Georges Perrin wrote: > > Hey all, > > i was giving a run to 2.1.1 and got an error on one of my test program: > > package

Re: Cassandra querying time stamps

2017-06-20 Thread Riccardo Ferrari
Hi, Personally I would inspect how dates are managed. How does your spark code looks like? What does the explain say. Does TimeStamp gets parsed the same way? Best, On Tue, Jun 20, 2017 at 12:52 PM, sujeet jog wrote: > Hello, > > I have a table as below > > CREATE TABLE

Unsubscribe

2017-06-20 Thread praba karan
Unsubscribe Sent from Yahoo Mail on Android

Cassandra querying time stamps

2017-06-20 Thread sujeet jog
Hello, I have a table as below CREATE TABLE analytics_db.ml_forecast_tbl ( "MetricID" int, "TimeStamp" timestamp, "ResourceID" timeuuid "Value" double, PRIMARY KEY ("MetricID", "TimeStamp", "ResourceID") ) select * from ml_forecast_tbl where "MetricID" = 1 and "TimeStamp" >

Re: Cassandra querying time stamps

2017-06-20 Thread sujeet jog
, Below is the query, looks like from physical plan, the query is same as that of cqlsh, val query = s"""(select * from model_data where TimeStamp > \'$timeStamp+\' and TimeStamp <= \'$startTS+\' and MetricID = $metricID)""" println("Model query" + query) val df

Re: Spark Streaming - Increasing number of executors slows down processing rate

2017-06-20 Thread Biplob Biswas
Hi Edwin, I have faced a similar issue as well and this behaviour is very abrupt. I even created a question on StackOverflow but no solution yet. https://stackoverflow.com/questions/43496205/spark-job-processing-time-increases-to-4s-without-explanation For us, we sometimes had this constant

Spark 2.1.1 and Hadoop version 2.2 or 2.7?

2017-06-20 Thread N B
I had downloaded the pre build package labeled "Spark 2.1.1 prebuilt with Hadoop 2.7 or later" from the direct download link on spark.apache.org. However, I am seeing compatibility errors running against a deployed HDFS 2.7.3. (See my earlier message about Flume DStream producing 0 records after

RE: Using Spark as a simulator

2017-06-20 Thread Mahesh Sawaiker
I have already seen on example where data is generated using spark, no reason to think it's a bad idea as far as I know. You can check the code here, I m not very sure but I think there is something there which generates data for the TPCDS benchmark and you can provide how much data you want in

RE: Merging multiple Pandas dataframes

2017-06-20 Thread Mendelson, Assaf
Note that depending on the number of iterations, the query plan for the dataframe can become long and this can cause slowdowns (or even crashes). A possible solution would be to checkpoint (or simply save and reload the dataframe) every once in a while. When reloading from disk, the newly loaded

spark2.1 and kafka0.10

2017-06-20 Thread lk_spark
hi,all : https://issues.apache.org/jira/browse/SPARK-19680 is this issue have any method to patch it ? I met the same problem. 2017-06-20 lk_spark

Using Spark as a simulator

2017-06-20 Thread Esa Heikkinen
Hi Spark is a data analyzer, but would it be possible to use Spark as a data generator or simulator ? My simulation can be very huge and i think a parallelized simulation using by Spark (cloud) could work. Is that good or bad idea ? Regards Esa Heikkinen

Re: Cassandra querying time stamps

2017-06-20 Thread sujeet jog
Correction. On Tue, Jun 20, 2017 at 5:27 PM, sujeet jog wrote: > , Below is the query, looks like from physical plan, the query is same as > that of cqlsh, > > val query = s"""(select * from model_data > where TimeStamp > \'$timeStamp+\' and TimeStamp <= >

spark higher order functions

2017-06-20 Thread AssafMendelson
Hi, I have seen that databricks have higher order functions (https://docs.databricks.com/_static/notebooks/higher-order-functions.html, https://databricks.com/blog/2017/05/24/working-with-nested-data-using-higher-order-functions-in-sql-on-databricks.html) which basically allows to do generic

Re: Do we anything for Deep Learning in Spark?

2017-06-20 Thread Michael Mior
It's still in the early stages, but check out Deep Learning Pipelines from Databricks https://github.com/databricks/spark-deep-learning -- Michael Mior mm...@apache.org 2017-06-20 0:36 GMT-04:00 Gaurav1809 : > Hi All, > > Similar to how we have machine learning library

Re: Using Spark as a simulator

2017-06-20 Thread Jörn Franke
It is fine, but you have to design it that generated rows are written in large blocks for optimal performance. The most tricky part with data generation is the conceptual part, such as probabilistic distribution etc You have to check as well that you use a good random generator, for some cases

Re: Do we anything for Deep Learning in Spark?

2017-06-20 Thread Jules Damji
And we will having a webinar on July 27 going into some more details. Stay tuned. Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On Jun 20, 2017, at 7:00 AM, Michael Mior wrote: > > It's still in the early stages, but check out Deep Learning

Re: org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Michael Armbrust
It's in the spark-catalyst_2.11-2.1.1.jar since the logical query plans and optimization also need to know about types. On Tue, Jun 20, 2017 at 1:14 PM, Jean Georges Perrin wrote: > Hey all, > > i was giving a run to 2.1.1 and got an error on one of my test program: > > package

Re: Bizzare diff in behavior between scala REPL and sparkSQL UDF

2017-06-20 Thread jeff saremi
never mind! I has a space at the end of my data which was not showing up in manual testing. thanks From: jeff saremi Sent: Tuesday, June 20, 2017 2:48:06 PM To: user@spark.apache.org Subject: Bizzare diff in behavior between scala REPL

org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
Hey all, i was giving a run to 2.1.1 and got an error on one of my test program: package net.jgp.labs.spark.l000_ingestion; import java.util.Arrays; import java.util.List; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import

Bizzare diff in behavior between scala REPL and sparkSQL UDF

2017-06-20 Thread jeff saremi
I have this function which does a regex matching in scala. I test it in the REPL I get expected results. I use it as a UDF in sparkSQL i get completely incorrect results. Function: class UrlFilter (filters: Seq[String]) extends Serializable { val regexFilters = filters.map(new Regex(_))

Re: "Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
Thanks Vadim & Jörn... I will look into those. jg > On Jun 20, 2017, at 2:12 PM, Vadim Semenov > wrote: > > You can launch one permanent spark context and then execute your jobs within > the context. And since they'll be running in the same context, they can

How to bootstrap Spark Kafka direct with the previous state in case of a code upgrade

2017-06-20 Thread SRK
Hi, How do we bootstrap the streaming job with the previous state when we do a code change and redeploy? We use updateStateByKey to maintain the state and store session objects and LinkedHashMaps in the checkpoint. Thanks, Swetha -- View this message in context:

"Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
Hey, Here is my need: program A does something on a set of data and produces results, program B does that on another set, and finally, program C combines the data of A and B. Of course, the easy way is to dump all on disk after A and B are done, but I wanted to avoid this. I was thinking of

Re: Merging multiple Pandas dataframes

2017-06-20 Thread Saatvik Shah
Hi Assaf, Thanks for the suggestion on checkpointing - I'll need to read up more on that. My current implementation seems to be crashing with a GC memory limit exceeded error if Im keeping multiple persist calls for a large number of files. Thus, I was also thinking about the constant calls to

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-20 Thread N B
BTW, this is running on Spark 2.1.1. I have been trying to debug this issue and what I have found till now is that it is somehow related to the Spark WAL. The directory named /receivedBlockMetadata seems to stop getting written to after the point of an HDFS node being killed and restarted. I have

Unsubscribe

2017-06-20 Thread Anita Tailor
Unsubscribe Sent from my iPhone

Re: "Sharing" dataframes...

2017-06-20 Thread Jörn Franke
You could all express it in one program, alternatively ignite in memory file system or the ignite sharedrdd ( not sure if dataframe is supported) > On 20. Jun 2017, at 19:46, Jean Georges Perrin wrote: > > Hey, > > Here is my need: program A does something on a set of data and

Re: "Sharing" dataframes...

2017-06-20 Thread Vadim Semenov
You can launch one permanent spark context and then execute your jobs within the context. And since they'll be running in the same context, they can share data easily. These two projects provide the functionality that you need: