Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
om there. This isn't ideal but it's better than nothing. -Ryan On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho wrote: > I'm also curious if this is possible, so while I can't offer a solution > maybe you could try the following. > > The driver and executor nodes need to have access

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Thanks Apostolos, I'm trying to avoid standing up HDFS just for this use case (single node). -Ryan On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos < papad...@csd.auth.gr> wrote: > Hi Ryan, > > since the driver is at your laptop, in order to access a remote file you &g

Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
em. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself. Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated! -Ryan Victory

Re: Left Join at SQL query gets planned as inner join

2020-04-30 Thread Ryan C. Kleck
and the planner recognizes that. Ryan Kleck Software Developer IV Customer Knowledge Platform From: Roland Johann Sent: Thursday, April 30, 2020 8:30:05 AM To: randy clinton Cc: Roland Johann ; user Subject: Re: Left Join at SQL query gets planned as inner join

Re: [Structured Streaming] NullPointerException in long running query

2020-04-29 Thread Shixiong(Ryan) Zhu
The stack trace is omitted by JVM when an exception is thrown too many times. This usually happens when you have multiple Spark tasks on the same executor JVM throwing the same exception. See https://stackoverflow.com/a/3010106 Best Regards, Ryan On Tue, Apr 28, 2020 at 10:45 PM lec ssmi wrote

Re: Spark 2.3 and Kafka client library version

2020-04-28 Thread Shixiong(Ryan) Zhu
You should be able to override the Kafka client version. The Kafka APIs used by Structured Streaming exist in new Kafka versions. There is a known correctness issue <https://issues.apache.org/jira/browse/KAFKA-4547> in Kafka 0.10.1.*. Other versions should be fine. Best Regards, Ryan

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-31 Thread Shixiong(Ryan) Zhu
The reason of this is Spark RPC and the persisted states of HA mode are both using Java serialization to serialize internal classes which don't have any compatibility guarantee. Best Regards, Ryan On Fri, Jan 31, 2020 at 9:08 AM Shixiong(Ryan) Zhu wrote: > Unfortunately, Spark standalone m

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-31 Thread Shixiong(Ryan) Zhu
versions. Best Regards, Ryan On Wed, Jan 29, 2020 at 2:12 AM bsikander wrote: > Anyone? > This question is not regarding my application running on top of Spark. > The question is about the upgrade of spark itself from 2.2 to 2.4. > > I expected atleast that spark would recove

Unsubscribe

2019-12-11 Thread Ryan Victory

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
I recommend you to use Structured Streaming as it has a patch that can workaround this issue: https://issues.apache.org/jira/browse/SPARK-26267 Best Regards, Ryan On Tue, Apr 30, 2019 at 3:34 PM Shixiong(Ryan) Zhu wrote: > There is a known issue that Kafka may return a wrong offset e

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
There is a known issue that Kafka may return a wrong offset even if there is no reset happening: https://issues.apache.org/jira/browse/KAFKA-7703 Best Regards, Ryan On Tue, Apr 30, 2019 at 10:41 AM Austin Weaver wrote: > @deng - There was a short erroneous period where 2 streams were read

Re: Spark streaming error - Query terminated with exception: assertion failed: Invalid batch: a#660,b#661L,c#662,d#663,,… 26 more fields != b#1291L

2019-04-01 Thread Shixiong(Ryan) Zhu
t;)) > .outputMode(OutputMode.Append()) > .start() > > streamingQuery.awaitTermination() > > if I remove the select (i.e. val df1 = df.filter(df("a") >= "-1")), it > works fine. > > Any idea why? > -- Best Regards, Ryan

Re: Spark Kafka Batch Write guarantees

2019-04-01 Thread Shixiong(Ryan) Zhu
> Does spark provide similar guarantee like it provides with saving > dataframe to disk; that partial data is not written to Kafka i.e. full > dataframe is saved or if job fails no data is written to Kafka topic. > > Thanks. > -- Best Regards, Ryan

Re: Manually reading parquet files.

2019-03-21 Thread Ryan Blue
doopConfWithOptions(relation.options)) > ) > > *import *scala.collection.JavaConverters._ > > *val *rows = readFile(pFile).flatMap(_ *match *{ > *case *r: InternalRow => *Seq*(r) > > // This doesn't work. vector mode is doing something screwy > *case *b: ColumnarBatch => b.rowIterator().asScala > }).toList > > *println*(rows) > //List([0,1,5b,24,66647361]) > //??this is wrong I think > > > > Has anyone attempted something similar? > > > > Cheers Andrew > > > -- Ryan Blue Software Engineer Netflix

Re: DataSourceV2 producing wrong date value in Custom Data Writer

2019-02-05 Thread Ryan Blue
get(0, > DataTypes.DateType)); > > } > > It prints an integer as output: > > MyDataWriter.write: 17039 > > > Is this a bug? or I am doing something wrong? > > Thanks, > Shubham > -- Ryan Blue Software Engineer Netflix

Re: Structured streaming from Kafka by timestamp

2019-01-24 Thread Shixiong(Ryan) Zhu
t; options when loading from Kafka. Best Regards, Ryan On Thu, Jan 24, 2019 at 10:15 AM Gabor Somogyi wrote: > Hi Tomas, > > As a general note don't fully understand your use-case. You've mentioned > structured streaming but your query is more like a one-time SQL statement. > Kaf

Re: Spark Streaming join taking long to process

2018-11-27 Thread Shixiong(Ryan) Zhu
If you are using the same code to run on Yarn, I believe it’s still using the local mode as it overwrites the master url set by CLI. You can check the “executors” tab in the Spark UI to set how many executors are running, and verify if it matches your config. On Tue, Nov 27, 2018 at 6:17 AM

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-19 Thread Ryan Blue
elson, Assaf >>> wrote: >>> >>> Could you add a fuller code example? I tried to reproduce it in my >>> environment and I am getting just one instance of the reader… >>> >>> >>> >>> Thanks, >>> >>> Assaf >

Re: Recreate Dataset from list of Row in spark streaming application.

2018-10-05 Thread Shixiong(Ryan) Zhu
oracle.insight.spark.identifier_processor.ForeachFederatedEventProcessor is a ForeachWriter. Right? You can not use SparkSession in its process method as it will run in executors. Best Regards, Ryan On Fri, Oct 5, 2018 at 6:54 AM Kuttaiah Robin wrote: > Hello, > > I have a spark

Re: PySpark structured streaming job throws socket exception

2018-10-04 Thread Shixiong(Ryan) Zhu
As far as I know, the error log in updateAccumulators will not fail a Spark task. Did you see other error messages? Best Regards, Ryan On Thu, Oct 4, 2018 at 2:14 PM mmuru wrote: > Hi, > > Running Pyspark structured streaming job on K8S with 2 executor pods. The > drive

Re: Kafka Connector version support

2018-09-21 Thread Shixiong(Ryan) Zhu
-dev +user We don't backport new features to a maintenance branch. All new updates will be just in 2.4. Best Regards, Ryan On Fri, Sep 21, 2018 at 2:44 PM, Basil Hariri < basil.har...@microsoft.com.invalid> wrote: > Hi all, > > > > Are there any plans to backport the

unsubscribe

2018-09-20 Thread Ryan Adams
unsubscribe Ryan Adams radams...@gmail.com

unsubscribe

2018-08-10 Thread Ryan Adams
Ryan Adams radams...@gmail.com

Re: Error while doing stream-stream inner join (java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access)

2018-07-03 Thread Shixiong(Ryan) Zhu
Which version are you using? There is a known issue regarding this and should be fixed in 2.3.1. See https://issues.apache.org/jira/browse/SPARK-23623 for details. Best Regards, Ryan On Mon, Jul 2, 2018 at 3:56 AM, kant kodali wrote: > Hi All, > > I get the below error quite often

Re: testing frameworks

2018-06-12 Thread Ryan Adams
an aggregate baseline. Ryan Ryan Adams radams...@gmail.com On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson wrote: > Hi, > > I wrote this answer to the same question a couple of years ago: > https://www.mail-archive.com/user%40spark.apache.org/msg48032.html > > I

Re: [structured-streaming][kafka] Will the Kafka readstream timeout after connections.max.idle.ms 540000 ms ?

2018-05-16 Thread Shixiong(Ryan) Zhu
The streaming query should keep polling data from Kafka. When the query was stopped, did you see any exception? Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com databricks.com [image: http://databricks.com]

Re: Continuous Processing mode behaves differently from Batch mode

2018-05-16 Thread Shixiong(Ryan) Zhu
One possible case is you don't have enough resources to launch all tasks for your continuous processing query. Could you check the Spark UI and see if all tasks are running rather than waiting for resources? Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com databricks.com

Re: I cannot use spark 2.3.0 and kafka 0.9?

2018-05-08 Thread Shixiong(Ryan) Zhu
"note that the 0.8 integration is compatible with later 0.9 and 0.10 brokers, but the 0.10 integration is not compatible with earlier brokers." This is pretty clear. You can use 0.8 integration to talk to 0.9 broker. Best Regards, Shixiong Zhu Databricks Inc. shixi...@databricks.com

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-03 Thread Ryan Blue
or shouldn't > come. Let me know if this understanding is correct > > On Tue, May 1, 2018 at 9:37 PM, Ryan Blue <rb...@netflix.com> wrote: > >> This is usually caused by skew. Sometimes you can work around it by in >> creasing the number of partitions like you tri

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
kFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:419) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349) > > -- Ryan Blue Software Engineer Netflix

Issues with large schema tables

2018-03-07 Thread Ballas, Ryan W
Hello All, Our team is having a lot of issues with the Spark API particularly with large schema tables. We currently have a program written in Scala that utilizes the Apache spark API to create two tables from raw files. We have one particularly very large raw data file that contains around

Unsubscribe

2018-02-19 Thread Ryan Myer
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Structured Streaming + Kafka - Corrupted Checkpoint Offsets / Commits

2018-01-04 Thread Shixiong(Ryan) Zhu
The root cause is probably that HDFSMetadataLog ignores exceptions thrown by "output.close". I think this should be fixed by this line in Spark 2.2.1 and 3.0.0: https://github.com/apache/spark/commit/6edfff055caea81dc3a98a6b4081313a0c0b0729#diff-aaeb546880508bb771df502318c40a99L126 Could you try

Re: Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-28 Thread Shixiong(Ryan) Zhu
The cluster mode doesn't upload jars to the driver node. This is a known issue: https://issues.apache.org/jira/browse/SPARK-4160 On Wed, Dec 27, 2017 at 1:27 AM, Geoff Von Allmen wrote: > I’ve tried it both ways. > > Uber jar gives me gives me the following: > >-

Re: What does Blockchain technology mean for Big Data? And how Hadoop/Spark will play role with it?

2017-12-19 Thread Ryan C. Kleck
be able to replace YARN at some point but won’t be able to replace HDFS or MR or spark. Regards, Ryan Kleck On Dec 19, 2017, at 7:29 AM, Vadim Semenov <vadim.seme...@datadoghq.com<mailto:vadim.seme...@datadoghq.com>> wrote: I think it means that we can replace HDFS with a blockch

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Shixiong(Ryan) Zhu
You are using Spark Streaming Kafka package. The correct package name is " org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0" On Mon, Nov 20, 2017 at 4:15 PM, salemi wrote: > Yes, we are using --packages > > $SPARK_HOME/bin/spark-submit --packages >

Re: How to print plan of Structured Streaming DataFrame

2017-11-20 Thread Shixiong(Ryan) Zhu
-dev +user Which Spark version are you using? There is a bug in the old Spark. Try to use the latest version. In addition, you can call `query.explain()` as well. On Mon, Nov 20, 2017 at 4:00 AM, Chang Chen wrote: > Hi Guys > > I modified StructuredNetworkWordCount to

Re: spark-stream memory table global?

2017-11-10 Thread Shixiong(Ryan) Zhu
It must be accessed under the same SparkSession. We can also add an option to make it be a global temp view. Feel free to open a PR to improve it. On Fri, Nov 10, 2017 at 4:56 AM, Imran Rajjad wrote: > Hi, > > Does the memory table in which spark-structured streaming results

Re: Structured Stream in Spark

2017-10-27 Thread Shixiong(Ryan) Zhu
The codes in the link write the data into files. Did you check the output location? By the way, if you want to see the data on the console, you can use the console sink by changing this line *format("parquet").option("path", outputPath + "/ETL").partitionBy("creationTime").start()* to

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread Shixiong(Ryan) Zhu
It's because "toJSON" doesn't support Structured Streaming. The current implementation will convert the Dataset to an RDD, which is not supported by streaming queries. On Sat, Sep 9, 2017 at 4:40 PM, kant kodali wrote: > yes it is a streaming dataset. so what is the problem

Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-01 Thread Ryan
I don't think ss now support "partitioned" watermark. and why different partition's consumption rate vary? If the handling logic is quite different, using different topic is a better way. On Fri, Sep 1, 2017 at 4:59 PM, 张万新 wrote: > Thanks, it's true that looser

Re: Spark GroupBy Save to different files

2017-09-01 Thread Ryan
you may try foreachPartition On Fri, Sep 1, 2017 at 10:54 PM, asethia wrote: > Hi, > > I have list of person records in following format: > > case class Person(fName:String, city:String) > > val l=List(Person("A","City1"),Person("B","City2"),Person("C","City1")) > > val

Re: [Structured Streaming]Data processing and output trigger should be decoupled

2017-08-30 Thread Shixiong(Ryan) Zhu
I don't think that's a good idea. If the engine keeps on processing data but doesn't output anything, where to keep the intermediate data? On Wed, Aug 30, 2017 at 9:26 AM, KevinZwx wrote: > Hi, > > I'm working with structured streaming, and I'm wondering whether there >

Re: PySpark, Structured Streaming and Kafka

2017-08-23 Thread Shixiong(Ryan) Zhu
You can use `bin/pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0` to start "pyspark". If you want to use "spark-submit", you also need to provide your Python file. On Wed, Aug 23, 2017 at 1:41 PM, Brian Wylie wrote: > Hi All, > > I'm trying the new

Re: [StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-12 Thread Shixiong(Ryan) Zhu
Spark creates one connection for each query. The behavior you observed is because how "nc -lk" works. If you use `netstat` to check the tcp connections, you will see there are two connections when starting two queries. However, "nc" forwards the input to only one connection. On Fri, Aug 11, 2017

Re: Does Spark SQL uses Calcite?

2017-08-11 Thread Ryan
the thrift server is a jdbc server, Kanth On Fri, Aug 11, 2017 at 2:51 PM, wrote: > I also wonder why there isn't a jdbc connector for spark sql? > > Sent from my iPhone > > On Aug 10, 2017, at 2:45 PM, Jules Damji wrote: > > Yes, it's more used in Hive

Re: How can I tell if a Spark job is successful or not?

2017-08-10 Thread Ryan
you could exit with error code just like normal java/scala application, and get it from driver/yarn On Fri, Aug 11, 2017 at 9:55 AM, Wei Zhang wrote: > I suppose you can find the job status from Yarn UI application view. > > > > Cheers, > > -z > > > > *From:* 陈宇航

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
a.com> wrote: > >> Riccardo and Ryan >>Thank you for your ideas.It seems that crossjoin is a new dataset api >> after spark2.x. >> my spark version is 1.6.3. Is there a relative api to do crossjoin? >> thank you. >> >> >> >>

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM,

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ryan Blue
. >> memoryOverhead. >> >> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8 >> >> Do you think below setting can help me to overcome above issue: >> >> spark.default.parellism=1000 >> spark.sql.shuffle.partitions=1000 >> >> Because default max number of partitions are 1000. >> >> >> > -- Ryan Blue Software Engineer Netflix

Re: How to insert a dataframe as a static partition to a partitioned table

2017-07-19 Thread Ryan
Not sure about the writer api, but you could always register a temp table for that dataframe and execute insert hql. On Thu, Jul 20, 2017 at 6:13 AM, ctang wrote: > I wonder if there are any easy ways (or APIs) to insert a dataframe (or > DataSet), which does not contain the

Re: spark streaming socket read issue

2017-06-30 Thread Shixiong(Ryan) Zhu
Could you show the codes that start the StreamingQuery from Dataset?. If you don't call `writeStream.start(...)`, it won't run anything. On Fri, Jun 30, 2017 at 6:47 AM, pradeepbill wrote: > hi there, I have a spark streaming issue that i am not able to figure out ,

Re: Understanding how spark share db connections created on driver

2017-06-29 Thread Ryan
I think it creates a new connection on each worker, whenever the Processor references Resource, it got initialized. There's no need for the driver connect to the db in this case. On Thu, Jun 29, 2017 at 5:52 PM, salvador wrote: > Hi all, > > I am writing a spark job from

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread Shixiong(Ryan) Zhu
"--package" will add transitive dependencies that are not "$SPARK_HOME/external/kafka-0-10-sql/target/*.jar". > i have tried building the jar with dependencies, but still face the same error. What's the command you used? On Wed, Jun 28, 2017 at 12:00 PM, satyajit vegesna <

Re: ZeroMQ Streaming in Spark2.x

2017-06-26 Thread Shixiong(Ryan) Zhu
It's moved to http://bahir.apache.org/ You can find document there. On Mon, Jun 26, 2017 at 11:58 AM, Aashish Chaudhary < aashish.chaudh...@kitware.com> wrote: > Hi there, > > I am a beginner when it comes to Spark streaming. I was looking for some > examples related to ZeroMQ and Spark and

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
ng it record by > record. mapPartitions() give us the ability to invoke this in bulk. We're > looking for a similar approach in SQL. > > > -- > *From:* Ryan <ryan.hd@gmail.com> > *Sent:* Sunday, June 25, 2017 7:18:32 PM > *To:* jeff saremi >

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-25 Thread Ryan
> private static Broadcast bcv; > public static void setBCV(Broadcast setbcv){ bcv = setbcv; } > public static Integer getBCV() > { > return bcv.value(); > } > } > > > On Fri, Jun 16, 2017 at 3:35 AM, Ryan <ryan.hd@gmail.com> wrote: > >>

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
s still germane. > > 2017-06-25 19:18 GMT-07:00 Ryan <ryan.hd@gmail.com>: > >> Why would you like to do so? I think there's no need for us to explicitly >> ask for a forEachPartition in spark sql because tungsten is smart enough to >> figure out whether a

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Why would you like to do so? I think there's no need for us to explicitly ask for a forEachPartition in spark sql because tungsten is smart enough to figure out whether a sql operation could be applied on each partition or there has to be a shuffle. On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi

Re: Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-19 Thread Shixiong(Ryan) Zhu
Some of projects (such as spark-tags) are Java projects. Spark doesn't fix the artifact name and just hard-core 2.11. For your issue, try to use `install` rather than `package`. On Sat, Jun 17, 2017 at 7:20 PM, Kanagha Kumar wrote: > Hi, > > Bumping up again! Why does

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-16 Thread Ryan
I don't think Broadcast itself can be serialized. you can get the value out on the driver side and refer to it in foreach, then the value would be serialized with the lambda expr and sent to workers. On Fri, Jun 16, 2017 at 2:29 AM, Anton Kravchenko < kravchenko.anto...@gmail.com> wrote: > How

Re: Is Structured streaming ready for production usage

2017-06-08 Thread Shixiong(Ryan) Zhu
Please take a look at http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy wrote: > OK. Can we use Spark Kafka Direct with Structured Streaming? > > On Thu, Jun 8, 2017 at 4:46 PM, swetha

Re: No TypeTag Available for String

2017-06-07 Thread Ryan
did you include the proper scala-reflect dependency? On Wed, May 31, 2017 at 1:01 AM, krishmah wrote: > I am currently using Spark 2.0.1 with Scala 2.11.8. However same code works > with Scala 2.10.6. Please advise if I am missing something > > import

Re: Worker node log not showed

2017-06-07 Thread Ryan
I think you need to get the logger within the lambda, otherwise it's the logger on driver side which can't work. On Wed, May 31, 2017 at 4:48 PM, Paolo Patierno wrote: > No it's running in standalone mode as Docker image on Kubernetes. > > > The only way I found was to

Re: good http sync client to be used with spark

2017-06-07 Thread Ryan
we use AsyncHttpClient(from the java world) and simply call future.get as synchronous call. On Thu, Jun 1, 2017 at 4:08 AM, vimal dinakaran wrote: > Hi, > In our application pipeline we need to push the data from spark streaming > to a http server. > > I would like to have

Re: Question about mllib.recommendation.ALS

2017-06-07 Thread Ryan
1. could you give job, stage & task status from Spark UI? I found it extremely useful for performance tuning. 2. use modele.transform for predictions. Usually we have a pipeline for preparing training data, and use the same pipeline to transform data you want to predict could give us the

Re: Java SPI jar reload in Spark

2017-06-07 Thread Ryan
I'd suggest scripts like js, groovy, etc.. To my understanding the service loader mechanism isn't a good fit for runtime reloading. On Wed, Jun 7, 2017 at 4:55 PM, Jonnas Li(Contractor) < zhongshuang...@envisioncn.com> wrote: > To be more explicit, I used mapwithState() in my application, just

Re: Convert the feature vector to raw data

2017-06-07 Thread Ryan
if you use StringIndexer to category the data, IndexToString could convert it back. On Wed, Jun 7, 2017 at 6:14 PM, kundan kumar wrote: > Hi Yan, > > This doesnt work. > > thanks, > kundan > > On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai) > wrote: >

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
I don't know what happened in your case so cannot provide any work around. It would be great if you can provide logs output by HDFSBackedStateStoreProvider. On Thu, May 25, 2017 at 4:05 PM, kant kodali <kanth...@gmail.com> wrote: > > On Thu, May 25, 2017 at 3:41 PM, Shixiong(Ryan) Z

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
com> wrote: > Should I file a ticket or should I try another version like Spark 2.2 > since I am currently using 2.1.1? > > On Thu, May 25, 2017 at 2:38 PM, kant kodali <kanth...@gmail.com> wrote: > >> Hi Ryan, >> >> You are right I was setting check

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
ion variable >> like HADOOP_CONF_DIR ? I am currently not setting that in >> conf/spark-env.sh and thats the only hadoop related environment variable I >> see. please let me know >> >> thanks! >> >> >> >> On Thu, May 25, 2017 at 1:19 AM, kant kodali <kanth

Re: Running into the same problem as JIRA SPARK-19268

2017-05-25 Thread Shixiong(Ryan) Zhu
rver.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > at org.apache.hadoop.ipc.Server$Handler.r

Re: Running into the same problem as JIRA SPARK-19268

2017-05-24 Thread Shixiong(Ryan) Zhu
What's the value of "hdfsCheckPointDir"? Could you list this directory on HDFS and report the files there? On Wed, May 24, 2017 at 3:50 PM, Michael Armbrust wrote: > -dev > > Have you tried clearing out the checkpoint directory? Can you also give > the full stack trace?

Re: How to print data to console in structured streaming using Spark 2.1.0?

2017-05-16 Thread Shixiong(Ryan) Zhu
The default "startingOffsets" is "latest". If you don't push any data after starting the query, it won't fetch anything. You can set it to "earliest" like ".option("startingOffsets", "earliest")" to start the stream from the beginning. On Tue, May 16, 2017 at 12:36 AM, kant kodali

Re: Application dies, Driver keeps on running

2017-05-15 Thread Shixiong(Ryan) Zhu
So you are using `client` mode. Right? If so, Spark cluster doesn't manage the driver for you. Did you see any error logs in driver? On Mon, May 15, 2017 at 3:01 PM, map reduced wrote: > Hi, > > Setup: Standalone cluster with 32 workers, 1 master > I am running a long

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Shixiong(Ryan) Zhu
This is because RDD.union doesn't check the schema, so you won't see the problem unless you run RDD and hit the incompatible column problem. For RDD, You may not see any error if you don't use the incompatible column. Dataset.union requires compatible schema. You can print ds.schema and

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
for a stage. In that version, you probably want to set spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs <http://spark.apache.org/docs/latest/configuration.html> and search for “blacklist” to see all the options. rb ​ On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue <rb...@netflix.c

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
> > > Regards > Sumit Chawla > > -- Ryan Blue Software Engineer Netflix

Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
searching I find sequence file might be a comparator of har you may interested with. Thanks for all people involved. I've learnt a lot too :-) On Thu, Apr 20, 2017 at 5:25 PM, 莫涛 <mo...@sensetime.com> wrote: > Hi Ryan, > > > The attachment is the event timeline on executors. Th

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
It shouldn't be a problem then. We've done the similar thing in scala. I don't have much experience with python thread but maybe the code related with reading/writing temp table isn't thread safe. On Mon, Apr 17, 2017 at 9:45 PM, Amol Patil <amol4soc...@gmail.com> wrote: > Thanks Ryan,

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
row group wouldn't be read if the predicate isn't satisfied due to index. 2. It is absolutely true the performance gain depends on the id distribution... On Mon, Apr 17, 2017 at 4:23 PM, 莫涛 <mo...@sensetime.com> wrote: > Hi Ryan, > > > The attachment is a screen shot for the sp

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
<mo...@sensetime.com> wrote: > Hi Ryan, > > > 1. "expected qps and response time for the filter request" > > I expect that only the requested BINARY are scanned instead of all > records, so the response time would be "10K * 5MB / disk read speed", or >

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
I don't think you can parallel insert into a hive table without dynamic partition, for hive locking please refer to https://cwiki.apache.org/confluence/display/Hive/Locking. Other than that, it should work. On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote: > Hi All, > >

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit. What's your expected qps and response time for the filter request? On Mon, Apr 17, 2017 at 2:23 PM, MoTao wrote: > Hi

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Ryan
you could write a udf using the asML method along with some type casting, then apply the udf to data after pca. when using pipeline, that udf need to be wrapped in a customized transformer, I think. On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath wrote: > Why not use

Re: Why VectorUDT private?

2017-03-29 Thread Ryan
spark version 2.1.0, vector is from ml package. the Vector in mllib has a public VectorUDT type On Thu, Mar 30, 2017 at 10:57 AM, Ryan <ryan.hd@gmail.com> wrote: > I'm writing a transformer and the input column is vector type(which is the > output column from other

Why VectorUDT private?

2017-03-29 Thread Ryan
I'm writing a transformer and the input column is vector type(which is the output column from other transformer). But as the VectorUDT is private, how could I check/transform schema for the vector column?

Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
and could you paste the stage and task information from SparkUI On Wed, Mar 29, 2017 at 11:30 AM, Ryan <ryan.hd@gmail.com> wrote: > how long does it take if you remove the repartition and just collect the > result? I don't think repartition is needed here. There's alrea

Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
how long does it take if you remove the repartition and just collect the result? I don't think repartition is needed here. There's already a shuffle for group by On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am working on requirement where i

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Shixiong(Ryan) Zhu
mapPartitionsWithSplit was removed in Spark 2.0.0. You can use mapPartitionsWithIndex instead. On Tue, Mar 28, 2017 at 3:52 PM, Anahita Talebi wrote: > Thanks. > I tried this one, as well. Unfortunately I still get the same error. > > > On Wednesday, March 29, 2017,

Re: Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Ryan
no you don't need one hot. but since the feature column is a vector and vector only accepts numbers, if your feature is string then a StringIndexer is needed. http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier here's an example. On Thu, Mar 23, 2017 at

Re: Converting dataframe to dataset question

2017-03-23 Thread Ryan
you should import either spark.implicits or sqlContext.implicits, not both. Otherwise the compiler will be confused about two implicit transformations following code works for me, spark version 2.1.0 object Test { def main(args: Array[String]) { val spark = SparkSession .builder

Re: Best way to deal with skewed partition sizes

2017-03-22 Thread Ryan
could you give the event timeline and dag for the time consuming stages on spark UI? On Thu, Mar 23, 2017 at 4:30 AM, Matt Deaver wrote: > For various reasons, our data set is partitioned in Spark by customer id > and saved to S3. When trying to read this data, however,

Re: Foreachpartition in spark streaming

2017-03-20 Thread Ryan
foreachPartition is an action but run on each worker, which means you won't see anything on driver. mapPartitions is a transformation which is lazy and won't do anything until an action. it depends on the specific use case which is better. To output sth(like a print in single machine) you could

Re: Failed to connect to master ...

2017-03-07 Thread Shixiong(Ryan) Zhu
The Spark master may bind to a different address. Take a look at this page to find the correct URL: http://VM_IPAddress:8080/ On Tue, Mar 7, 2017 at 10:13 PM, Mina Aslani wrote: > Master and worker processes are running! > > On Wed, Mar 8, 2017 at 12:38 AM, ayan guha

Re: Structured Streaming - Kafka

2017-03-07 Thread Shixiong(Ryan) Zhu
Good catch. Could you create a ticket? You can also submit a PR to fix it if you have time :) On Tue, Mar 7, 2017 at 1:52 PM, Bowden, Chris wrote: > Potential bug when using startingOffsets = SpecificOffsets with Kafka > topics containing uppercase characters? > >

Re: Why spark history server does not show RDD even if it is persisted?

2017-02-28 Thread Shixiong(Ryan) Zhu
The REST APIs are not just for Spark history server. When an application is running, you can use the REST APIs to talk to Spark UI HTTP server as well. On Tue, Feb 28, 2017 at 10:46 AM, Parag Chaudhari wrote: > ping... > > > > *Thanks,Parag Chaudhari,**USC Alumnus (Fight

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
progress" > java.lang.OutOfMemoryError: Java heap space at > java.util.Arrays.copyOfRange(Arrays.java:3664) at > java.lang.String.(String.java:207) at > java.lang.StringBuilder.toString(StringBuilder.java:407) at > scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:430) > at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:101) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:55) > at java.util.TimerThread.mainLoop(Timer.java:555) at > java.util.TimerThread.run(Timer.java:505) > > -- Ryan Blue Software Engineer Netflix

Re: Fault tolerant broadcast in updateStateByKey

2017-02-07 Thread Shixiong(Ryan) Zhu
It's documented here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints On Tue, Feb 7, 2017 at 8:12 AM, Amit Sela wrote: > Hi all, > > I was wondering if anyone ever used a broadcast variable within > an

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread Shixiong(Ryan) Zhu
You can create lazily instantiated singleton instances. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints for examples of accumulators and broadcast variables. You can use the same approach to create your cached RDD. On Tue,

  1   2   3   >