Re: Converting dataframe to dataset question

2017-03-23 Thread Ryan
you should import either spark.implicits or sqlContext.implicits, not both. Otherwise the compiler will be confused about two implicit transformations following code works for me, spark version 2.1.0 object Test { def main(args: Array[String]) { val spark = SparkSession .builder

Re: Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Ryan
no you don't need one hot. but since the feature column is a vector and vector only accepts numbers, if your feature is string then a StringIndexer is needed. http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier here's an example. On Thu, Mar 23, 2017 at

Re: Best way to deal with skewed partition sizes

2017-03-22 Thread Ryan
could you give the event timeline and dag for the time consuming stages on spark UI? On Thu, Mar 23, 2017 at 4:30 AM, Matt Deaver wrote: > For various reasons, our data set is partitioned in Spark by customer id > and saved to S3. When trying to read this data, however,

Re: Why VectorUDT private?

2017-03-29 Thread Ryan
spark version 2.1.0, vector is from ml package. the Vector in mllib has a public VectorUDT type On Thu, Mar 30, 2017 at 10:57 AM, Ryan <ryan.hd@gmail.com> wrote: > I'm writing a transformer and the input column is vector type(which is the > output column from other

Why VectorUDT private?

2017-03-29 Thread Ryan
I'm writing a transformer and the input column is vector type(which is the output column from other transformer). But as the VectorUDT is private, how could I check/transform schema for the vector column?

Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
and could you paste the stage and task information from SparkUI On Wed, Mar 29, 2017 at 11:30 AM, Ryan <ryan.hd@gmail.com> wrote: > how long does it take if you remove the repartition and just collect the > result? I don't think repartition is needed here. There's alrea

Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
how long does it take if you remove the repartition and just collect the result? I don't think repartition is needed here. There's already a shuffle for group by On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am working on requirement where i

Re: Foreachpartition in spark streaming

2017-03-20 Thread Ryan
foreachPartition is an action but run on each worker, which means you won't see anything on driver. mapPartitions is a transformation which is lazy and won't do anything until an action. it depends on the specific use case which is better. To output sth(like a print in single machine) you could

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Ryan
you could write a udf using the asML method along with some type casting, then apply the udf to data after pca. when using pipeline, that udf need to be wrapped in a customized transformer, I think. On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath wrote: > Why not use

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
row group wouldn't be read if the predicate isn't satisfied due to index. 2. It is absolutely true the performance gain depends on the id distribution... On Mon, Apr 17, 2017 at 4:23 PM, 莫涛 <mo...@sensetime.com> wrote: > Hi Ryan, > > > The attachment is a screen shot for the sp

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit. What's your expected qps and response time for the filter request? On Mon, Apr 17, 2017 at 2:23 PM, MoTao wrote: > Hi

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
I don't think you can parallel insert into a hive table without dynamic partition, for hive locking please refer to https://cwiki.apache.org/confluence/display/Hive/Locking. Other than that, it should work. On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote: > Hi All, > >

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
<mo...@sensetime.com> wrote: > Hi Ryan, > > > 1. "expected qps and response time for the filter request" > > I expect that only the requested BINARY are scanned instead of all > records, so the response time would be "10K * 5MB / disk read speed", or >

Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
searching I find sequence file might be a comparator of har you may interested with. Thanks for all people involved. I've learnt a lot too :-) On Thu, Apr 20, 2017 at 5:25 PM, 莫涛 <mo...@sensetime.com> wrote: > Hi Ryan, > > > The attachment is the event timeline on executors. Th

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
It shouldn't be a problem then. We've done the similar thing in scala. I don't have much experience with python thread but maybe the code related with reading/writing temp table isn't thread safe. On Mon, Apr 17, 2017 at 9:45 PM, Amol Patil <amol4soc...@gmail.com> wrote: > Thanks Ryan,

Re: Does Spark SQL uses Calcite?

2017-08-11 Thread Ryan
the thrift server is a jdbc server, Kanth On Fri, Aug 11, 2017 at 2:51 PM, wrote: > I also wonder why there isn't a jdbc connector for spark sql? > > Sent from my iPhone > > On Aug 10, 2017, at 2:45 PM, Jules Damji wrote: > > Yes, it's more used in Hive

Re: How can I tell if a Spark job is successful or not?

2017-08-10 Thread Ryan
you could exit with error code just like normal java/scala application, and get it from driver/yarn On Fri, Aug 11, 2017 at 9:55 AM, Wei Zhang wrote: > I suppose you can find the job status from Yarn UI application view. > > > > Cheers, > > -z > > > > *From:* 陈宇航

Re: How to insert a dataframe as a static partition to a partitioned table

2017-07-19 Thread Ryan
Not sure about the writer api, but you could always register a temp table for that dataframe and execute insert hql. On Thu, Jul 20, 2017 at 6:13 AM, ctang wrote: > I wonder if there are any easy ways (or APIs) to insert a dataframe (or > DataSet), which does not contain the

Re: Understanding how spark share db connections created on driver

2017-06-29 Thread Ryan
I think it creates a new connection on each worker, whenever the Processor references Resource, it got initialized. There's no need for the driver connect to the db in this case. On Thu, Jun 29, 2017 at 5:52 PM, salvador wrote: > Hi all, > > I am writing a spark job from

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM,

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
a.com> wrote: > >> Riccardo and Ryan >>Thank you for your ideas.It seems that crossjoin is a new dataset api >> after spark2.x. >> my spark version is 1.6.3. Is there a relative api to do crossjoin? >> thank you. >> >> >> >>

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Why would you like to do so? I think there's no need for us to explicitly ask for a forEachPartition in spark sql because tungsten is smart enough to figure out whether a sql operation could be applied on each partition or there has to be a shuffle. On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
s still germane. > > 2017-06-25 19:18 GMT-07:00 Ryan <ryan.hd@gmail.com>: > >> Why would you like to do so? I think there's no need for us to explicitly >> ask for a forEachPartition in spark sql because tungsten is smart enough to >> figure out whether a

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-25 Thread Ryan
> private static Broadcast bcv; > public static void setBCV(Broadcast setbcv){ bcv = setbcv; } > public static Integer getBCV() > { > return bcv.value(); > } > } > > > On Fri, Jun 16, 2017 at 3:35 AM, Ryan <ryan.hd@gmail.com> wrote: > >>

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
ng it record by > record. mapPartitions() give us the ability to invoke this in bulk. We're > looking for a similar approach in SQL. > > > -- > *From:* Ryan <ryan.hd@gmail.com> > *Sent:* Sunday, June 25, 2017 7:18:32 PM > *To:* jeff saremi >

Re: Convert the feature vector to raw data

2017-06-07 Thread Ryan
if you use StringIndexer to category the data, IndexToString could convert it back. On Wed, Jun 7, 2017 at 6:14 PM, kundan kumar wrote: > Hi Yan, > > This doesnt work. > > thanks, > kundan > > On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai) > wrote: >

Re: Java SPI jar reload in Spark

2017-06-07 Thread Ryan
I'd suggest scripts like js, groovy, etc.. To my understanding the service loader mechanism isn't a good fit for runtime reloading. On Wed, Jun 7, 2017 at 4:55 PM, Jonnas Li(Contractor) < zhongshuang...@envisioncn.com> wrote: > To be more explicit, I used mapwithState() in my application, just

Re: Question about mllib.recommendation.ALS

2017-06-07 Thread Ryan
1. could you give job, stage & task status from Spark UI? I found it extremely useful for performance tuning. 2. use modele.transform for predictions. Usually we have a pipeline for preparing training data, and use the same pipeline to transform data you want to predict could give us the

Re: good http sync client to be used with spark

2017-06-07 Thread Ryan
we use AsyncHttpClient(from the java world) and simply call future.get as synchronous call. On Thu, Jun 1, 2017 at 4:08 AM, vimal dinakaran wrote: > Hi, > In our application pipeline we need to push the data from spark streaming > to a http server. > > I would like to have

Re: Worker node log not showed

2017-06-07 Thread Ryan
I think you need to get the logger within the lambda, otherwise it's the logger on driver side which can't work. On Wed, May 31, 2017 at 4:48 PM, Paolo Patierno wrote: > No it's running in standalone mode as Docker image on Kubernetes. > > > The only way I found was to

Re: No TypeTag Available for String

2017-06-07 Thread Ryan
did you include the proper scala-reflect dependency? On Wed, May 31, 2017 at 1:01 AM, krishmah wrote: > I am currently using Spark 2.0.1 with Scala 2.11.8. However same code works > with Scala 2.10.6. Please advise if I am missing something > > import

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-16 Thread Ryan
I don't think Broadcast itself can be serialized. you can get the value out on the driver side and refer to it in foreach, then the value would be serialized with the lambda expr and sent to workers. On Fri, Jun 16, 2017 at 2:29 AM, Anton Kravchenko < kravchenko.anto...@gmail.com> wrote: > How

Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-01 Thread Ryan
I don't think ss now support "partitioned" watermark. and why different partition's consumption rate vary? If the handling logic is quite different, using different topic is a better way. On Fri, Sep 1, 2017 at 4:59 PM, 张万新 wrote: > Thanks, it's true that looser

Re: Spark GroupBy Save to different files

2017-09-01 Thread Ryan
you may try foreachPartition On Fri, Sep 1, 2017 at 10:54 PM, asethia wrote: > Hi, > > I have list of person records in following format: > > case class Person(fName:String, city:String) > > val l=List(Person("A","City1"),Person("B","City2"),Person("C","City1")) > > val

All pairs shortest paths?

2014-03-26 Thread Ryan Compton
No idea how feasible this is. Has anyone done it?

Re: All pairs shortest paths?

2014-03-26 Thread Ryan Compton
To clarify: I don't need the actual paths, just the distances. On Wed, Mar 26, 2014 at 3:04 PM, Ryan Compton compton.r...@gmail.com wrote: No idea how feasible this is. Has anyone done it?

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty(spark.shuffle.consolidate.files, true) and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton compton.r...@gmail.com wrote: Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B

GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
I am trying to read an edge list into a Graph. My data looks like 394365859 -- 136153151 589404147 -- 1361045425 I read it into a Graph via: val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName) val edgeTupRDD = edgeFullStrRDD.map(x = x.split(\t)) .map(x

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave ankurd...@gmail.com wrote: I wasn't able to reproduce this

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
, Ryan Compton compton.r...@gmail.com wrote: I'm trying shoehorn a label propagation-ish algorithm into GraphX. I need to update each vertex with the median value of their neighbors. Unlike PageRank, which updates each vertex with the mean of their neighbors, I don't have a simple commutative

Spark 1.0: slf4j version conflicts with pig

2014-05-27 Thread Ryan Compton
I use both Pig and Spark. All my code is built with Maven into a giant *-jar-with-dependencies.jar. I recently upgraded to Spark 1.0 and now all my pig scripts fail with: Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job:

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
/bidirectional-network-current/part-r-1' USING PigStorage() AS (id1:long, id2:long, weight:int); ttt = LIMIT edgeList0 10; DUMP ttt; On Wed, May 28, 2014 at 12:55 PM, Ryan Compton compton.r...@gmail.com wrote: It appears to be Spark 1.0 related. I made a pom.xml with a single dependency on Spark

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
posted a JIRA https://issues.apache.org/jira/browse/SPARK-1952 On Wed, May 28, 2014 at 1:14 PM, Ryan Compton compton.r...@gmail.com wrote: Remark, just including the jar built by sbt will produce the same error. i,.e this pig script will fail: REGISTER /usr/share/osi1/spark-1.0.0/assembly

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-06 Thread Ryan Compton
Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to

Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Ryan Tabora
4221 2014-07-31 01:01 /tmp/README.md Regards, Ryan Tabora http://ryantabora.com

scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Ryan Williams
I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here https://gist.github.com/ryan-williams/7673d7da928570907f4d is a full gist of an innocuous test command (mvn test -Dsuites

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Ryan Williams
encountering this issue. Typically you would have changed one or more of the profiles/options - which leads to this occurring. 2014-10-22 22:00 GMT-07:00 Ryan Williams ryan.blake.willi...@gmail.com: I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Nov 12, 2014 at 5:42 AM, SK skrishna...@gmail.com wrote: But

Re: Data Loss - Spark streaming

2014-12-16 Thread Ryan Williams
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas gerard.m...@gmail.com wrote: Hi Jeniba, The second part of this meetup recording has a very good answer to your question. TD explains the current behavior and the on-going

Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
and have a bunch of ideas about where it should go. Thanks, -Ryan

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
Thanks so much Shixiong! This is great. On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu zsxw...@gmail.com wrote: Ryan - I sent a PR to fix your issue: https://github.com/apache/spark/pull/6599 Edward - I have no idea why the following error happened. ContextCleaner doesn't use any Hadoop API

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
I think this is causing issues upgrading ADAM https://github.com/bigdatagenomics/adam to Spark 1.3.1 (cf. adam#690 https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383); attempting to build against Hadoop 1.0.4 yields errors like: 2015-06-02 15:57:44 ERROR Executor:96 -

Spree: a live-updating web UI for Spark

2015-07-27 Thread Ryan Williams
or comments! -Ryan

Re: SequenceFile and object reuse

2015-11-18 Thread Ryan Williams
Hey Jeff, in addition to what Sandy said, there are two more reasons that this might not be as bad as it seems; I may be incorrect in my understanding though. First, the "additional step" you're referring to is not likely to be adding any overhead; the "extra map" is really just materializing the

Re: Appropriate Apache Users List Uses

2016-02-09 Thread Ryan Victory
Yeah, a little disappointed with this, I wouldn't expect to be sent unsolicited mail based on my membership to this list. -Ryan Victory On Tue, Feb 9, 2016 at 1:36 PM, John Omernik <j...@omernik.com> wrote: > All, I received this today, is this appropriate list use? Note: This was >

Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread Ryan Blue
know dictionary of words > if > >> there is no schema provided by user? Where/how to specify my schema / > >> config for Parquet format? > >> > >> Could not find Apache Parquet mailing list in the official site. It > would > >> be great if anyone could share it as well. > >> > >> Regards > >> Ashok > >> > > > > > -- Ryan Blue Software Engineer Netflix

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
progress" > java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:3664) at > java.lang.String.(String.java:207) at > java.lang.StringBuilder.toString(StringBuilder.java:407) at > scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:430) > at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:101) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:55) > at java.util.TimerThread.mainLoop(Timer.java:555) at > java.util.TimerThread.run(Timer.java:505) > > -- Ryan Blue Software Engineer Netflix

Re: Apache Hive with Spark Configuration

2017-01-03 Thread Ryan Blue
astore, can you tell me which > version is more compatible with Spark 2.0.2 ? > > THanks > -- Ryan Blue Software Engineer Netflix

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
> > > Regards > Sumit Chawla > > -- Ryan Blue Software Engineer Netflix

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
for a stage. In that version, you probably want to set spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs <http://spark.apache.org/docs/latest/configuration.html> and search for “blacklist” to see all the options. rb ​ On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue <rb...@netflix.c

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ryan Blue
. >> memoryOverhead. >> >> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8 >> >> Do you think below setting can help me to overcome above issue: >> >> spark.default.parellism=1000 >> spark.sql.shuffle.partitions=1000 >> >> Because default max number of partitions are 1000. >> >> >> > -- Ryan Blue Software Engineer Netflix

Re: testing frameworks

2018-06-12 Thread Ryan Adams
an aggregate baseline. Ryan Ryan Adams radams...@gmail.com On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson wrote: > Hi, > > I wrote this answer to the same question a couple of years ago: > https://www.mail-archive.com/user%40spark.apache.org/msg48032.html > > I

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
kFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349) > > -- Ryan Blue Software Engineer Netflix

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-03 Thread Ryan Blue
or shouldn't > come. Let me know if this understanding is correct > > On Tue, May 1, 2018 at 9:37 PM, Ryan Blue <rb...@netflix.com> wrote: > >> This is usually caused by skew. Sometimes you can work around it by in >> creasing the number of partitions like you tri

Unsubscribe

2018-02-19 Thread Ryan Myer
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

unsubscribe

2018-08-10 Thread Ryan Adams
Ryan Adams radams...@gmail.com

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-19 Thread Ryan Blue
elson, Assaf >>> wrote: >>> >>> Could you add a fuller code example? I tried to reproduce it in my >>> environment and I am getting just one instance of the reader… >>> >>> >>> >>> Thanks, >>> >>> Assaf >

unsubscribe

2018-09-20 Thread Ryan Adams
unsubscribe Ryan Adams radams...@gmail.com

Re: Manually reading parquet files.

2019-03-21 Thread Ryan Blue
doopConfWithOptions(relation.options)) > ) > > *import *scala.collection.JavaConverters._ > > *val *rows = readFile(pFile).flatMap(_ *match *{ > *case *r: InternalRow => *Seq*(r) > > // This doesn't work. vector mode is doing something screwy > *case *b: ColumnarBatch => b.rowIterator().asScala > }).toList > > *println*(rows) > //List([0,1,5b,24,66647361]) > //??this is wrong I think > > > > Has anyone attempted something similar? > > > > Cheers Andrew > > > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 producing wrong date value in Custom Data Writer

2019-02-05 Thread Ryan Blue
get(0, > DataTypes.DateType)); > > } > > It prints an integer as output: > > MyDataWriter.write: 17039 > > > Is this a bug? or I am doing something wrong? > > Thanks, > Shubham > -- Ryan Blue Software Engineer Netflix

Unsubscribe

2019-12-11 Thread Ryan Victory

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
om there. This isn't ideal but it's better than nothing. -Ryan On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho wrote: > I'm also curious if this is possible, so while I can't offer a solution > maybe you could try the following. > > The driver and executor nodes need to have access

Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
em. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself. Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated! -Ryan Victory

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Thanks Apostolos, I'm trying to avoid standing up HDFS just for this use case (single node). -Ryan On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos < papad...@csd.auth.gr> wrote: > Hi Ryan, > > since the driver is at your laptop, in order to access a remote file you &g

Re: Negative Number of Active Tasks in Spark UI

2016-01-05 Thread Shixiong(Ryan) Zhu
Did you enable "spark.speculation"? On Tue, Jan 5, 2016 at 9:14 AM, Prasad Ravilla wrote: > I am using Spark 1.5.2. > > I am not using Dynamic allocation. > > Thanks, > Prasad. > > > > > On 1/5/16, 3:24 AM, "Ted Yu" wrote: > > >Which version of Spark do

Re: Double Counting When Using Accumulators with Spark Streaming

2016-01-05 Thread Shixiong(Ryan) Zhu
Hey Rachana, There are two jobs in your codes actually: `rdd.isEmpty` and `rdd.saveAsTextFile`. Since you don't cache or checkpoint this rdd, it will execute your map function twice for each record. You can move "accum.add(1)" to "rdd.saveAsTextFile" like this: JavaDStream lines =

Re: How to register a Tuple3 with KryoSerializer?

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "_", e.g., sparkConf.registerKryoClasses(Array(classOf[scala.Tuple3[_, _, _]])) Best Regards, Shixiong(Ryan) Zhu Software Engineer Databricks Inc. shixi...@databricks.com databricks.com <http://databricks.com/> On Wed, Dec 30, 2015 at 10:16 AM, Russ <russ.br..

Re: 2 of 20,675 Spark Streaming : Out put frequency different from read frequency in StatefulNetworkWordCount

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "reduceByKeyAndWindow", e.g., val lines = ssc.socketTextStream("localhost", ) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(60), Seconds(60)) wordCounts.print() On Wed, Dec

Re: Kryo serializer Exception during serialization: java.io.IOException: java.lang.IllegalArgumentException:

2016-01-08 Thread Shixiong(Ryan) Zhu
Could you disable `spark.kryo.registrationRequired`? Some classes may not be registered but they work well with Kryo's default serializer. On Fri, Jan 8, 2016 at 8:58 AM, Ted Yu wrote: > bq. try adding scala.collection.mutable.WrappedArray > > But the hint said registering

Re: Too many tasks killed the scheduler

2016-01-11 Thread Shixiong(Ryan) Zhu
Could you use "coalesce" to reduce the number of partitions? Shixiong Zhu On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue wrote: > Here is more info. > > The job stuck at: > INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks > > Then got the error: > Caused

Re: [Spark 1.6][Streaming] About the behavior of mapWithState

2016-01-15 Thread Shixiong(Ryan) Zhu
Hey Terry, That's expected. If you want to only output (1, 3), you can use "reduceByKey" before "mapWithState" like this: dstream.reduceByKey(_ + _).mapWithState(spec) On Fri, Jan 15, 2016 at 1:21 AM, Terry Hoo wrote: > Hi, > I am doing a simple test with mapWithState,

Re: 答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-15 Thread Shixiong(Ryan) Zhu
gt; allKafkaWindowData = > this.sparkReceiverDStream.createReceiverDStream(jsc,this.streamingConf.getWindowDuration(), > > this.streamingConf.getSlideDuration()); > > > > this.businessProcess(allKafkaWindowData); > > this.sleep(); > >jsc.start(); &

Re: Calling SparkContext methods in scala Future

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey Marco, Since the codes in Future is in an asynchronous way, you cannot call "sparkContext.stop" at the end of "fetch" because the codes in Future may not finish. However, the exception seems weird. Do you have a simple reproducer? On Mon, Jan 18, 2016 at 9:13 AM, Ted Yu

Re: Spark Streaming: Does mapWithState implicitly partition the dsteram?

2016-01-18 Thread Shixiong(Ryan) Zhu
mapWithState uses HashPartitioner by default. You can use "StateSpec.partitioner" to set your custom partitioner. On Sun, Jan 17, 2016 at 11:00 AM, Lin Zhao wrote: > When the state is passed to the task that handles a mapWithState for a > particular key, if the key is

Re: Incorrect timeline for Scheduling Delay in Streaming page in web UI?

2016-01-18 Thread Shixiong(Ryan) Zhu
Hey, did you mean that the scheduling delay timeline is incorrect because it's too short and some values are missing? A batch won't have a scheduling delay until it starts to run. In your example, a lot of batches are waiting so that they don't have the scheduling delay. On Sun, Jan 17, 2016 at

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you change MEMORY_ONLY_SER to MEMORY_AND_DISK_SER_2 and see if this still happens? It may be because you don't have enough memory to cache the events. On Thu, Jan 14, 2016 at 4:06 PM, Lin Zhao wrote: > Hi, > > I'm testing spark streaming with actor receiver. The actor

Re: How to bind webui to localhost?

2016-01-14 Thread Shixiong(Ryan) Zhu
Yeah, it's hard code as "0.0.0.0". Could you send a PR to add a configuration for it? On Thu, Jan 14, 2016 at 2:51 PM, Zee Chen wrote: > Hi, what is the easiest way to configure the Spark webui to bind to > localhost or 127.0.0.1? I intend to use this with ssh socks proxy to >

Re: Spark Streaming: custom actor receiver losing vast majority of data

2016-01-14 Thread Shixiong(Ryan) Zhu
:31:31 INFO storage.BlockManager: Removing RDD 44 > 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 43 > 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 42 > 16/01/15 00:31:31 INFO storage.BlockManager: Removing RDD 41 > 16/01/15 00:31:31 INFO storage.BlockMana

Re: NPE when using Joda DateTime

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you try to use "Kryo.setDefaultSerializer" like this: class YourKryoRegistrator extends KryoRegistrator { override def registerClasses(kryo: Kryo) { kryo.setDefaultSerializer(classOf[com.esotericsoftware.kryo.serializers.JavaSerializer]) } } On Thu, Jan 14, 2016 at 12:54 PM, Durgesh

Re: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-14 Thread Shixiong(Ryan) Zhu
Could you show your codes? Did you use `StreamingContext.awaitTermination`? If so, it will return if any exception happens. On Wed, Jan 13, 2016 at 11:47 PM, Triones,Deng(vip.com) < triones.d...@vipshop.com> wrote: > What’s more, I am running a 7*24 hours job , so I won’t call System.exit() > by

Re: 答复: 答复: 答复: 答复: spark streaming context trigger invoke stop why?

2016-01-18 Thread Shixiong(Ryan) Zhu
oint(RDD.scala:300) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) > > at org.a

Re: Spark Streaming : Limiting number of receivers per executor

2016-02-10 Thread Shixiong(Ryan) Zhu
You can't. The number of cores must be great than the number of receivers. On Wed, Feb 10, 2016 at 2:34 AM, ajay garg wrote: > Hi All, > I am running 3 executors in my spark streaming application with 3 > cores per executors. I have written my custom receiver

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-04 Thread Shixiong(Ryan) Zhu
tion.Iterator$$anon$ > org.apache.spark.InterruptibleIterator# > scala.collection.IndexedSeqLike$Elements# > scala.collection.mutable.ArrayOps$ofRef# > java.lang.Object[]# > > > > > On Thu, Feb 4, 2016 at 7:14 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrot

Re: Spark Streaming - 1.6.0: mapWithState Kinesis huge memory usage

2016-02-04 Thread Shixiong(Ryan) Zhu
Hey Udo, mapWithState usually uses much more memory than updateStateByKey since it caches the states in memory. However, from your description, looks BlockGenerator cannot push data into BlockManager, there may be something wrong in BlockGenerator. Could you share the top 50 objects in the heap

Re: [Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-09 Thread Shixiong(Ryan) Zhu
Could you do a thread dump in the executor that runs the Kinesis receiver and post it? It would be great if you can provide the executor log as well? On Tue, Feb 9, 2016 at 3:14 PM, Roberto Coluccio wrote: > Hello, > > can anybody kindly help me out a little bit

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Are you using a custom input dstream? If so, you can make the `compute` method return None to skip a batch. On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu wrote: > I was wondering if there is there any way to skip batches with zero events > when streaming? > By skip I

Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
is behaviour > > Thanks! > On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" <shixi...@databricks.com> > wrote: > >> Are you using a custom input dstream? If so, you can make the `compute` >> method return None to skip a batch. >> >> On Thu, Feb 11

Re: PairDStreamFunctions.mapWithState fails in case timeout is set without updating State[S]

2016-02-04 Thread Shixiong(Ryan) Zhu
Thanks for reporting it. I will take a look. On Thu, Feb 4, 2016 at 6:56 AM, Yuval.Itzchakov wrote: > Hi, > I've been playing with the expiramental PairDStreamFunctions.mapWithState > feature and I've seem to have stumbled across a bug, and was wondering if > anyone else has

Re: Spark Streaming from existing RDD

2016-01-29 Thread Shixiong(Ryan) Zhu
Do you just want to write some unit tests? If so, you can use "queueStream" to create a DStream from a queue of RDDs. However, because it doesn't support metadata checkpointing, it's better to only use it in unit tests. On Fri, Jan 29, 2016 at 7:35 AM, Sateesh Karuturi <

  1   2   3   >