Re: Re: spark2.1 kafka0.10

2017-06-21 Thread lk_spark
each topic have 5 partition , 2 replicas . 2017-06-22 lk_spark 发件人:Pralabh Kumar 发送时间:2017-06-22 17:23 主题:Re: spark2.1 kafka0.10 收件人:"lk_spark" 抄送:"user.spark" How many replicas ,you have for this topic . On Thu, Jun

Re: Using YARN w/o HDFS

2017-06-21 Thread Chen He
chang your fs.defaultFS to point to local file system and have a try On Wed, Jun 21, 2017 at 4:50 PM, Alaa Zubaidi (PDF) wrote: > Hi, > > Can we run Spark on YARN with out installing HDFS? > If yes, where would HADOOP_CONF_DIR point to? > > Regards, > > *This message may

Unsubscribe

2017-06-21 Thread Anita Tailor
Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark2.1 kafka0.10

2017-06-21 Thread Pralabh Kumar
How many replicas ,you have for this topic . On Thu, Jun 22, 2017 at 9:19 AM, lk_spark wrote: > java.lang.IllegalStateException: No current assignment for partition > pages-2 > at org.apache.kafka.clients.consumer.internals.SubscriptionState. >

Re: spark2.1 kafka0.10

2017-06-21 Thread lk_spark
java.lang.IllegalStateException: No current assignment for partition pages-2 at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:264) at org.apache.kafka.clients.consumer.internals.SubscriptionState.needOffsetReset(SubscriptionState.java:336)

spark2.1 kafka0.10

2017-06-21 Thread lk_spark
hi,all: when I run stream application for a few minutes ,I got this error : 17/06/22 10:34:56 INFO ConsumerCoordinator: Revoking previously assigned partitions [comment-0, profile-1, profile-3, cwb-3, bizs-1, cwb-1, weibocomment-0, bizs-2, pages-0, bizs-4, pages-2, weibo-0, pages-4, weibo-4,

Re: Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-21 Thread Kanagha Kumar
The problem I see is that the and defined in profile - scala2.10 are not getting picked up by the submodules while doing maven release - 3.3.9 version. It works correctly while doing mvn package though. I also changed pom.xml default properties to have 2.10 scala versions and tried maven

Using YARN w/o HDFS

2017-06-21 Thread Alaa Zubaidi (PDF)
Hi, Can we run Spark on YARN with out installing HDFS? If yes, where would HADOOP_CONF_DIR point to? Regards, -- *This message may contain confidential and privileged information. If it has been sent to you in error, please reply to advise the sender of the error and then immediately

Re: Broadcasts & Storage Memory

2017-06-21 Thread Bryan Jeffrey
Satish, I agree - that was my impression too. However I am seeing a smaller set of storage memory used on a given executor than the amount of memory required for my broadcast variables. I am wondering if the statistics in the ui are incorrect or if the broadcasts are simply not a part of

Re: Broadcasts & Storage Memory

2017-06-21 Thread satish lalam
My understanding is - it from storageFraction. Here cached blocks are immune to eviction - so both persisted RDDs and broadcast variables sit here. Ref

Broadcasts & Storage Memory

2017-06-21 Thread Bryan Jeffrey
Hello. Question: Do broadcast variables stored on executors count as part of 'storage memory' or other memory? A little bit more detail: I understand that we have two knobs to control memory allocation: - spark.memory.fraction - spark.memory.storageFraction My understanding is that

Unsubscribe

2017-06-21 Thread Tao Lu
Unsubscribe

Re: "Sharing" dataframes...

2017-06-21 Thread Pierce Lamb
Hi Jean, Since many in this thread have mentioned datastores from what I would call the "Spark datastore ecosystem" I thought I would link you to a StackOverflow answer I posted awhile back that tried to capture the majority of this ecosystem. Most would claim to allow you to do something like

Re: Do we anything for Deep Learning in Spark?

2017-06-21 Thread Suzen, Mehmet
There is a BigDL project: https://github.com/intel-analytics/BigDL On 20 June 2017 at 16:17, Jules Damji wrote: > And we will having a webinar on July 27 going into some more details. Stay > tuned. > > Cheers > Jules > > Sent from my iPhone > Pardon the dumb thumb typos :)

Re: "Sharing" dataframes...

2017-06-21 Thread Gene Pang
Hi Jean, As others have mentioned, you can use Alluxio with Spark dataframes to keep the data in memory, and for other jobs to read them from memory again. Hope this helps, Gene On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges

Re: "Sharing" dataframes...

2017-06-21 Thread Jean Georges Perrin
I have looked at Livy in the (very recent past) past and it will not do the trick for me. It seems pretty greedy in terms of resources (or at least that was our experience). I will investigate how job-server could do the trick. (on a side note I tried to find a paper on memory lifecycle within

Re: "Sharing" dataframes...

2017-06-21 Thread Michael Mior
This is a puzzling suggestion to me. It's unclear what features the OP needs, so it's really hard to say whether Livy or job-server aren't sufficient. It's true that neither are particularly mature, but they're much more mature than a homemade project which hasn't started yet. That said, I'm not

Re: JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Aviral Agarwal
This woks. Thanks ! - Aviral Agarwal On Wed, Jun 21, 2017 at 6:07 PM, Eduardo Mello wrote: > You can add "?zeroDateTimeBehavior=convertToNull" to the connection > string. > > On Wed, Jun 21, 2017 at 9:04 AM, Aviral Agarwal > wrote: > >> The

Re: JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Eduardo Mello
You can add "?zeroDateTimeBehavior=convertToNull" to the connection string. On Wed, Jun 21, 2017 at 9:04 AM, Aviral Agarwal wrote: > The exception is happening in JDBC RDD code where getNext() is called to > get the next row. > I do not have access to the result set. I am

RE: JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Aviral Agarwal
The exception is happening in JDBC RDD code where getNext() is called to get the next row. I do not have access to the result set. I am operating on a DataFrame. Thanks and Regards, Aviral Agarwal On Jun 21, 2017 17:19, "Mahesh Sawaiker" wrote: > This has to do

RE: JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Mahesh Sawaiker
This has to do with how you are creating the timestamp object from the resultset ( I guess). If you can provide more code it will help, but you could surround the parsing code with a try catch and then just ignore the exception. From: Aviral Agarwal [mailto:aviral12...@gmail.com] Sent:

RE: Using Spark as a simulator

2017-06-21 Thread Mahesh Sawaiker
Spark can help you to create one large file if needed, but hdfs itself will provide abstraction over such things, so it's a trivial problem if anything. If you have spark installed, then you can use spark-shell to try a few samples, and build from there.If you can collect all the files in a

gfortran runtime library for Spark

2017-06-21 Thread Saroj C
Dear All, Can you please let me know, if gfortran runtime library is still required for Spark 2.1, for better performance. Note, I am using Java APIs for Spark ? Thanks & Regards Saroj =-=-= Notice: The information contained in this e-mail message and/or attachments to it

JDBC RDD Timestamp Parsing Issue

2017-06-21 Thread Aviral Agarwal
Hi, I am using JDBC RDD to read from a MySQL RDBMS. My spark job fails with the below error : java.sql.SQLException: Value '-00-00 00:00:00.000' can not be represented as java.sql.Timestamp Now instead of the whole job failing I want to skip this record and continue processing the rest.

Saving RDD as Kryo (broken in 2.1)

2017-06-21 Thread Alexander Krasheninnikov
Hi, all! I have a code, serializing RDD as Kryo, and saving it as sequence file. It works fine in 1.5.1, but, while switching to 2.1.1 it does not work. I am trying to serialize RDD of Tuple2<> (got from PairRDD). 1. RDD consists of different heterogeneous objects (aggregates, like HLL,

VS: Using Spark as a simulator

2017-06-21 Thread Esa Heikkinen
Hi Thanks for the answer. I think my simulator includes a lot of parallel state machines and each of them generates log file (with timestamps). Finally all events (rows) of all log files should combine as time order to (one) very huge log file. Practically the combined huge log file can

Re: "Sharing" dataframes...

2017-06-21 Thread Rick Moritz
Keeping it inside the same program/SparkContext is the most performant solution, since you can avoid serialization and deserialization. In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM and invokes serialization and deserialization. Technologies that can help you do that

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-21 Thread N B
Hadoop version 2.7.3 On Tue, Jun 20, 2017 at 11:12 PM, yohann jardin wrote: > Which version of Hadoop are you running on? > > *Yohann Jardin* > Le 6/21/2017 à 1:06 AM, N B a écrit : > > Ok some more info about this issue to see if someone can shine a light on > what

Re: Spark 2.1.1 and Hadoop version 2.2 or 2.7?

2017-06-21 Thread yohann jardin
https://spark.apache.org/docs/2.1.0/building-spark.html#specifying-the-hadoop-version Version Hadoop v2.2.0 only is the default build version, but other versions can still be built. The package you downloaded is prebuilt for Hadoop 2.7 as said on the download page, don't worry. Yohann Jardin

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-21 Thread yohann jardin
Which version of Hadoop are you running on? Yohann Jardin Le 6/21/2017 à 1:06 AM, N B a écrit : Ok some more info about this issue to see if someone can shine a light on what could be going on. I turned on debug logging for org.apache.spark.streaming.scheduler in the driver process and this is

RE: Merging multiple Pandas dataframes

2017-06-21 Thread Mendelson, Assaf
If you do an action, most intermediate calculations would be gone for the next iteration. What I would do is persist every iteration, then after some (say 5) I would write to disk and reload. At that point you should call unpersist to free the memory as it is no longer relevant. Thanks,