RE: How does predicate push down really help?

2016-11-16 Thread Mendelson, Assaf
In the first example, you define the table to be table users from some SQL server. Then you perform a filter. Without predicate pushdown (or any optimization) basically spark understand this as follows: “grab the data from the source described” (which in this case means get all of the table

Re: How does predicate push down really help?

2016-11-16 Thread Ashic Mahtab
Consider a data source that has data in 500mb files, and doesn't support predicate push down. Spark will have to load all the data into memory before it can apply filtering, select "columns" etc. Each 500mb file will at some point have to be loaded entirely in memory. Now consider a data source

Re: How does predicate push down really help?

2016-11-16 Thread kant kodali
Hi Assaf, I am still trying to understand the merits of predicate push down from the examples you pointed out. Example 1: Say we don't have a predicate push down feature why does spark needs to pull all the rows and filter it in memory? why not simply issue select statement with "where" clause

RE: Nested UDFs

2016-11-16 Thread Mendelson, Assaf
Regexp_replace is supposed to receive a column, you don’t need to write a UDF for it. Instead try: Test_data.select(regexp_Replace(test_data.name, ‘a’, ‘X’) You would need a Udf if you would wanted to do something on the string value of a single row (e.g. return data + “bla”) Assaf. From:

RE: How does predicate push down really help?

2016-11-16 Thread Mendelson, Assaf
Actually, both you translate to the same plan. When you do sql(“some code”) or filter, it doesn’t actually do the query. Instead it is translated to a plan (parsed plan) which transform everything into standard spark expressions. Then spark analyzes it to fill in the blanks (what is users table

Nested UDFs

2016-11-16 Thread Perttu Ranta-aho
Hi, Shouldn't this work? from pyspark.sql.functions import regexp_replace, udf def my_f(data): return regexp_replace(data, 'a', 'X') my_udf = udf(my_f) test_data = sqlContext.createDataFrame([('a',), ('b',), ('c',)], ('name',)) test_data.select(my_udf(test_data.name)).show() But instead

How does predicate push down really help?

2016-11-16 Thread kant kodali
How does predicate push down really help? in the following cases val df1 = spark.sql("select * from users where age > 30") vs val df1 = spark.sql("select * from users") df.filter("age > 30")

Re: Kafka segmentation

2016-11-16 Thread bo yang
I did not remember what exact configuration I was using. That link has some good information! Thanks Cody! On Wed, Nov 16, 2016 at 5:32 PM, Cody Koeninger wrote: > Yeah, if you're reporting issues, please be clear as to whether > backpressure is enabled, and whether

Re: Very long pause/hang at end of execution

2016-11-16 Thread Michael Johnson
On Wed, Nov 16, 2016 at 10:44 AM Aniket Bhatnagar wrote: Thanks for sharing the thread dump. I had a look at them and couldn't find anything unusual. Is there anything in the logs (driver + executor) that suggests what's going on? Also, what does the spark job do

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Nathan Lande
If you are dealing with a bunch of different schemas in 1 field, figuring out a strategy to deal with that will depend on your data and does not really have anything to do with spark since mapping your JSON payloads to tractable data structures will depend on business logic. The strategy of

Re: Best practice for preprocessing feature with DataFrame

2016-11-16 Thread Divya Gehlot
Hi, You can use the Column functions provided by Spark API https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html Hope this helps . Thanks, Divya On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote: > Hi, > I have a sample, like: >

Best practice for preprocessing feature with DataFrame

2016-11-16 Thread Yan Facai
Hi, I have a sample, like: +---+--++ |age|gender| city_id| +---+--++ | | 1|1042015:city_2044...| |90s| 2|1042015:city_2035...| |80s| 2|1042015:city_2061...| +---+--++ and expectation is: "age": 90s

Configure spark.kryoserializer.buffer.max at runtime does not take effect

2016-11-16 Thread bluishpenguin
Hi all, I would like to configure the following setting during runtime as below: spark = (SparkSession .builder .appName("ElasticSearchIndex") .config("spark.kryoserializer.buffer.max", "1g") .getOrCreate()) But I still hit error, Caused by: org.apache.spark.SparkException:

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Karim, Md. Rezaul
Hi Tariq and Jon, At first thanks for quick response. I really appreciate that. Well, I would like to start from the very begging of using Kafka with Spark. For example, in the Spark distribution, I found an example using Kafka with Spark streaming that demonstrates a Direct Kafka Word Count

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Jon Gregg
Since you're completely new to Kafka, I would start with the Kafka docs ( https://kafka.apache.org/documentation). You should be able to get through the Getting Started part easily and there are some examples for setting up a basic Kafka server. You don't need Kafka to start working with Spark

Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-16 Thread Yanbo Liang
Hi Pietro, Actually we have implemented R survreg() counterpart in Spark: Accelerated failure time model. You can refer AFTSurvivalRegression if you use Scala/Java/Python. For SparkR users, you can try spark.survreg(). The algorithms is completely distributed and return the same solution with

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Yeah, if you're reporting issues, please be clear as to whether backpressure is enabled, and whether maxRatePerPartition is set. I expect that there is something wrong with backpressure, see e.g. https://issues.apache.org/jira/browse/SPARK-18371 On Wed, Nov 16, 2016 at 5:05 PM, bo yang

Re: Kafka segmentation

2016-11-16 Thread bo yang
I hit similar issue with Spark Streaming. The batch size seemed a little random. Sometime it was large with many Kafka messages inside same batch, sometimes it was very small with just a few messages. Is it possible that was caused by the backpressure implementation in Spark Streaming? On Wed,

Re: Kafka segmentation

2016-11-16 Thread Cody Koeninger
Moved to user list. I'm not really clear on what you're trying to accomplish (why put the csv file through Kafka instead of reading it directly with spark?) auto.offset.reset=largest just means that when starting the job without any defined offsets, it will start at the highest (most recent)

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
Ryan, I just wanted to provide more info. Here is my .proto file which is the basis for generating the Person class. Thanks. option java_package = "com.example.protos"; enum Gender { MALE = 1; FEMALE = 2; } message Address { optional string street = 1; optional string city = 2; }

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
*Thanks for the response. Following is the Person class..* // Generated by the Scala Plugin for the Protocol Buffer Compiler. // Do not edit! // // Protofile syntax: PROTO2 package com.example.protos.demo @SerialVersionUID(0L) final case class Person( name: scala.Option[String] = None,

SparkILoop doesn't run

2016-11-16 Thread Mohit Jaggi
I am trying to use SparkILoop to write some tests(shown below) but the test hangs with the following stack trace. Any idea what is going on? import org.apache.log4j.{Level, LogManager} import org.apache.spark.repl.SparkILoop import org.scalatest.{BeforeAndAfterAll, FunSuite} class

Any with S3 experience with Spark? Having ListBucket issues

2016-11-16 Thread Edden Burrow
Anyone dealing with a lot of files with spark? We're trying s3a with 2.0.1 because we're seeing intermittent errors in S3 where jobs fail and saveAsText file fails. Using pyspark. Is there any issue with working in a S3 folder that has too many files? How about having versioning enabled? Are

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread Shixiong(Ryan) Zhu
Could you provide the Person class? On Wed, Nov 16, 2016 at 1:19 PM, shyla deshpande wrote: > I am using 2.11.8. Thanks > > On Wed, Nov 16, 2016 at 1:15 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Which Scala version are you using? Is it Scala 2.10?

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
I am using 2.11.8. Thanks On Wed, Nov 16, 2016 at 1:15 PM, Shixiong(Ryan) Zhu wrote: > Which Scala version are you using? Is it Scala 2.10? Scala 2.10 has some > known race conditions in reflection and the Scala community doesn't have > plan to fix it

Re: Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread Shixiong(Ryan) Zhu
Which Scala version are you using? Is it Scala 2.10? Scala 2.10 has some known race conditions in reflection and the Scala community doesn't have plan to fix it ( http://docs.scala-lang.org/overviews/reflection/thread-safety.html) AFAIK, the only way to fix it is upgrading to Scala 2.11. On Wed,

How to propagate R_LIBS to sparkr executors

2016-11-16 Thread Rodrick Brown
I’m having an issue with a R module not getting picked up on the slave nodes in mesos. I have the following environment value R_LIBS set and for some reason this environment is only set in the driver context and not the executor is their a way to pass environment values down the executor level

RE: submitting a spark job using yarn-client and getting NoClassDefFoundError: org/apache/spark/Logging

2016-11-16 Thread David Robison
I’ve gotten a little further along. It now submits the job via Yarn, but now the jobs exit immediately with the following error: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/Logging at java.lang.ClassLoader.defineClass1(Native Method) at

Spark 2.0.2 with Kafka source, Error please help!

2016-11-16 Thread shyla deshpande
I am using protobuf to encode. This may not be related to the new release issue Exception in thread "main" scala.ScalaReflectionException: is not a term at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:199) at

Re: Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Mohammad Tariq
Hi Karim, Are you looking for something specific? Some information about your usecase would be really helpful in order to answer your question. On Wednesday, November 16, 2016, Karim, Md. Rezaul < rezaul.ka...@insight-centre.org> wrote: > Hi All, > > I am completely new with Kafka. I was

[SQL/Catalyst] Janino Generated Code Debugging

2016-11-16 Thread Aleksander Eskilson
Hi there, I have some jobs generating Java code (via Janino) that I would like to inspect more directly during runtime. The Janino page seems to indicate an environmental variable can be set to support debugging the generated code, allowing one to step into it directly and inspect variables and

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Michael Armbrust
On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon wrote: > Maybe it sounds like you are looking for from_json/to_json functions after > en/decoding properly. > Which are new built-in functions that will be released with Spark 2.1.

RE: Spark UI shows Jobs are processing, but the files are already written to S3

2016-11-16 Thread Shreya Agarwal
I think that is a bug. I have seen that a lot especially with long running jobs where Spark skips a lot of stages because it has pre-computed results. And some of these are never marked as completed, even though in reality they are. I figured this out because I was using the interactive shell

Need guidelines in Spark Streaming and Kafka integration

2016-11-16 Thread Karim, Md. Rezaul
Hi All, I am completely new with Kafka. I was wondering if somebody could provide me some guidelines on how to develop real-time streaming applications using Spark Streaming API with Kafka. I am aware the Spark Streaming and Kafka integration [1]. However, a real life example should be better

Spark UI shows Jobs are processing, but the files are already written to S3

2016-11-16 Thread Kuchekar
Hi, I am running a spark job, which saves the computed data (massive data) to S3. On the Spark Ui I see the some jobs are active, but no activity in the logs. Also on S3 all the data has be written (verified each bucket --> it has _SUCCESS file) Am I missing something? Thanks. Kuchekar,

RE: CSV to parquet preserving partitioning

2016-11-16 Thread neil90
All you need to do is load all the files into one dataframe at once. Then save the dataframe using partitionBy - df.write.format("parquet").partitionBy("directoryCol").save("hdfs://path") Then if you look at the new folder it should look like how you want it I.E -

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Hyukjin Kwon
It seems a bit weird. Could we open an issue and talk in the repository link I sent? Let me try to reproduce your case with your data if possible. On 17 Nov 2016 2:26 a.m., "Arun Patel" wrote: > I tried below options. > > 1) Increase executor memory. Increased up to

Re: use case reading files split per id

2016-11-16 Thread ruben
Yes that binary files function looks interesting, thanks for the tip. Some followup questions: - I always wonder when people are talking about 'small' files and 'large' files. Is there any rule of thumb when these things apply? Are small files those which can fit completely in memory on the node

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-16 Thread Russell Jurney
Asher, can you cast like that? Does that casting work? That is my confusion: I don't know what a DataFrame Vector turns into in terms of an RDD type. I'll try this, thanks. On Tue, Nov 15, 2016 at 11:25 AM, Asher Krim wrote: > What language are you using? For Java, you might

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Arun Patel
I tried below options. 1) Increase executor memory. Increased up to maximum possibility 14GB. Same error. 2) Tried new version - spark-xml_2.10:0.4.1. Same error. 3) Tried with low level rowTags. It worked for lower level rowTag and returned 16000 rows. Are there any workarounds for this

RE: CSV to parquet preserving partitioning

2016-11-16 Thread benoitdr
Yes, by parsing the file content, it's possible to recover in which directory they are. From: neil90 [via Apache Spark User List] [mailto:ml-node+s1001560n28083...@n3.nabble.com] Sent: mercredi 16 novembre 2016 17:41 To: Drooghaag, Benoit (Nokia - BE) Subject: Re:

Re: CSV to parquet preserving partitioning

2016-11-16 Thread neil90
Is there anything in the files to let you know which directory they should be in? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28083.html Sent from the Apache Spark User List mailing list archive at

RE: AVRO File size when caching in-memory

2016-11-16 Thread Shreya Agarwal
Ah, yes. Nested schemas should be avoided if you want the best memory usage. Sent from my Windows 10 phone From: Prithish Sent: Wednesday, November 16, 2016 12:48 AM To: Takeshi Yamamuro Cc: Shreya Agarwal;

HttpFileServer behavior in 1.6.3

2016-11-16 Thread Kai Wang
Hi I am running Spark 1.6.3 along with Spark Jobserver. I notice some interesting behaviors of HttpFileServer. When I destroy a SparkContext, HttpFileServer doesn't release the port. If I don't specify spark.fileserver.port, the next HttpFileServer binds to a new random port (as expected).

Re: Log-loss for multiclass classification

2016-11-16 Thread janardhan shetty
I am sure some work might be in pipeline as it is a normal evaluation criteria. Any thoughts or links ? On Nov 15, 2016 11:15 AM, "janardhan shetty" wrote: > Hi, > > Best practice for multi class classification technique is to evaluate the > model by *log-loss*. > Is

RE: Problem submitting a spark job using yarn-client as master

2016-11-16 Thread David Robison
Unfortunately, it doesn’t get that far in my code where I have a SparkContext from which to set the Hadoop config parameters. Here is my Java code: SparkConf sparkConf = new SparkConf() .setJars(new String[] { "file:///opt/wildfly/mapreduce/mysparkjob-5.0.0.jar", })

RE: CSV to parquet preserving partitioning

2016-11-16 Thread Drooghaag, Benoit (Nokia - BE)
Good point, thanks ! That does the job from the moment the datasets corresponding to each input directory contain a single partition. Question now is how to achieve this without shuffling the data ? I’m using the databricks csv reader on spark 1.6 and I don’t think there is a way to control

problem deploying spark-jobserver on CentOS

2016-11-16 Thread Reza zade
Hi I'm going to deploy jobserver on my CentOS (spark is installed with cdh5.7). I'm using oracle jdk1.8, sbt-0.13.13, spark-1.6.0 and jobserver-0.6.2. When I run sbt command (after running sbt publish-local) I encountered the bellow message : [cloudera@quickstart spark-jobserver]$ sbt [info]

Re: Writing parquet table using spark

2016-11-16 Thread Dirceu Semighini Filho
Hello, Have you configured this property? spark.sql.parquet.compression.codec 2016-11-16 6:40 GMT-02:00 Vaibhav Sinha : > Hi, > I am using hiveContext.sql() method to select data from source table and > insert into parquet tables. > The query executed from spark takes

Map and MapParitions with partition-local variable

2016-11-16 Thread Zsolt Tóth
Hi, I need to run a map() and a mapPartitions() on my input DF. As a side-effect of the map(), a partition-local variable should be updated, that is used in the mapPartitions() afterwards. I can't use Broadcast variable, because it's shared between partitions on the same executor. Where can I

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Hyukjin Kwon
Maybe it sounds like you are looking for from_json/to_json functions after en/decoding properly. On 16 Nov 2016 6:45 p.m., "kant kodali" wrote: > > > https://spark.apache.org/docs/2.0.2/sql-programming-guide. > html#json-datasets > > "Spark SQL can automatically infer the

Re: Very long pause/hang at end of execution

2016-11-16 Thread Aniket Bhatnagar
Also, how are you launching the application? Through spark submit or creating spark content in your app? Thanks, Aniket On Wed, Nov 16, 2016 at 10:44 AM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Thanks for sharing the thread dump. I had a look at them and couldn't find > anything

Re: Very long pause/hang at end of execution

2016-11-16 Thread Aniket Bhatnagar
Thanks for sharing the thread dump. I had a look at them and couldn't find anything unusual. Is there anything in the logs (driver + executor) that suggests what's going on? Also, what does the spark job do and what is the version of spark and hadoop you are using? Thanks, Aniket On Wed, Nov 16,

How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread kant kodali
https://spark.apache.org/docs/2.0.2/sql-programming-guide.html#json-datasets "Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String, or a JSON file." val df =

Re: Very long pause/hang at end of execution

2016-11-16 Thread Pietro Pugni
I have the same issue with Spark 2.0.1, Java 1.8.x and pyspark. I also use SparkSQL and JDBC. My application runs locally. It happens only of I connect to the UI during Spark execution and even if I close the browser before the execution ends. I observed this behaviour both on macOS Sierra and Red

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
It's something like the schema shown below (with several additional levels/sublevels) root |-- sentAt: long (nullable = true) |-- sharing: string (nullable = true) |-- receivedAt: long (nullable = true) |-- ip: string (nullable = true) |-- story: struct (nullable = true) ||-- super:

Writing parquet table using spark

2016-11-16 Thread Vaibhav Sinha
Hi, I am using hiveContext.sql() method to select data from source table and insert into parquet tables. The query executed from spark takes about 3x more disk space to write the same number of rows compared to when fired from impala. Just wondering if this is normal behaviour and if there's a way

Re: AVRO File size when caching in-memory

2016-11-16 Thread Takeshi Yamamuro
Hi, What's the schema interpreted by spark? A compression logic of the spark caching depends on column types. // maropu On Wed, Nov 16, 2016 at 5:26 PM, Prithish wrote: > Thanks for your response. > > I did some more tests and I am seeing that when I have a flatter

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
Thanks for your response. I did some more tests and I am seeing that when I have a flatter structure for my AVRO, the cache memory use is close to the CSV. But, when I use few levels of nesting, the cache memory usage blows up. This is really critical for planning the cluster we will be using. To

Re: what is the optimized way to combine multiple dataframes into one dataframe ?

2016-11-16 Thread Deepak Sharma
Can you try caching the individual dataframes and then union them? It may save you time. Thanks Deepak On Wed, Nov 16, 2016 at 12:35 PM, Devi P.V wrote: > Hi all, > > I have 4 data frames with three columns, > > client_id,product_id,interest > > I want to combine these 4