Spark-Kafka integration - build failing with sbt

2017-06-16 Thread karan alang
I'm trying to compile kafka & Spark Streaming integration code i.e. reading
from Kafka using Spark Streaming,
  and the sbt build is failing with error -

  [error] (*:update) sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found

  Scala version -> 2.10.7
  Spark Version -> 2.1.0
  Kafka version -> 0.9
  sbt version -> 0.13

Contents of sbt files is as shown below ->

1)
  vi spark_kafka_code/project/plugins.sbt

  addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")

 2)
  vi spark_kafka_code/sparkkafka.sbt

import AssemblyKeys._
assemblySettings

name := "SparkKafka Project"

version := "1.0"
scalaVersion := "2.11.7"

val sparkVers = "2.1.0"

// Base Spark-provided dependencies
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVers % "provided",
  "org.apache.spark" %% "spark-streaming" % sparkVers % "provided",
  "org.apache.spark" %% "spark-streaming-kafka" % sparkVers)

mergeStrategy in assembly := {
  case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
  case m if m.toLowerCase.startsWith("META-INF")  => MergeStrategy.discard
  case "reference.conf"   => MergeStrategy.concat
  case m if m.endsWith("UnusedStubClass.class")   => MergeStrategy.discard
  case _ => MergeStrategy.first
}

  i launch sbt, and then try to create an eclipse project, complete error
is as shown below -

  -

  sbt
[info] Loading global plugins from /Users/karanalang/.sbt/0.13/plugins
[info] Loading project definition from
/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/project
[info] Set current project to SparkKafka Project (in build
file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/)
> eclipse
[info] About to create Eclipse project files for your project(s).
[info] Updating
{file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/}spark_kafka_code...
[info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
[warn] module not found:
org.apache.spark#spark-streaming-kafka_2.11;2.1.0
[warn]  local: tried
[warn]
/Users/karanalang/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
[warn]  activator-launcher-local: tried
[warn]
/Users/karanalang/.activator/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
[warn]  activator-local: tried
[warn]
/Users/karanalang/Documents/Technology/SCALA/activator-dist-1.3.10/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
[warn]  public: tried
[warn]
https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
[warn]  typesafe-releases: tried
[warn]
http://repo.typesafe.com/typesafe/releases/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
[warn]  typesafe-ivy-releasez: tried
[warn]
http://repo.typesafe.com/typesafe/ivy-releases/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
[info] Resolving jline#jline;2.12.1 ...
[warn] ::
[warn] ::  UNRESOLVED DEPENDENCIES ::
[warn] ::
[warn] :: org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
[warn] ::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.spark:spark-streaming-kafka_2.11:2.1.0
(/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/sparkkafka.sbt#L12-16)
[warn]   +- sparkkafka-project:sparkkafka-project_2.11:1.0
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
[info] Updating
{file:/Users/karanalang/Documents/Technology/Coursera_spark_scala/spark_kafka_code/}spark_kafka_code...
[info] Resolving org.apache.spark#spark-streaming-kafka_2.11;2.1.0 ...
[warn] module not found:
org.apache.spark#spark-streaming-kafka_2.11;2.1.0
[warn]  local: tried
[warn]
/Users/karanalang/.ivy2/local/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
[warn]  activator-launcher-local: tried
[warn]
/Users/karanalang/.activator/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
[warn]  activator-local: tried
[warn]
/Users/karanalang/Documents/Technology/SCALA/activator-dist-1.3.10/repository/org.apache.spark/spark-streaming-kafka_2.11/2.1.0/ivys/ivy.xml
[warn]  public: tried
[warn]
https://repo1.maven.org/maven2/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
[warn]  typesafe-releases: tried
[warn]
http://repo.typesafe.com/typesafe/releases/org/apache/spark/spark-streaming-kafka_2.11/2.1.0/spark-streaming-kafka_2.11-2.1.0.pom
[warn]  

Error while doing mvn release for spark 2.0.2 using scala 2.10

2017-06-16 Thread Kanagha Kumar
Hey all,


I'm trying to use Spark 2.0.2 with scala 2.10 by following this
https://spark.apache.org/docs/2.0.2/building-spark.html#building-for-scala-210

./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package


I could build the distribution successfully using
bash -xv dev/make-distribution.sh --tgz  -Dscala-2.10 -DskipTests

But, when I am trying to maven release, it keeps failing with the error
using the command:


Executing Maven:  -B -f pom.xml  -DscmCommentPrefix=[maven-release-plugin]
-e  -Dscala-2.10 -Pyarn -Phadoop-2.7 -Phadoop-provided -DskipTests
-Dresume=false -U -X *release:prepare release:perform*

Failed to execute goal on project spark-sketch_2.10: Could not resolve
dependencies for project
org.apache.spark:spark-sketch_2.10:jar:2.0.2-sfdc-3.0.0: *Failure to find
org.apache.spark:spark-tags_2.11:jar:2.0.2-sfdc-3.0.0* in  was cached in the local repository, resolution will not be
reattempted until the update interval of nexus has elapsed or updates are
forced - [Help 1]


Why does spark-sketch depend upon spark-tags_2.11 when I have already
compiled against scala 2.10?? Any pointers would be helpful.
Thanks
Kanagha


Re: Spark SQL within a DStream map function

2017-06-16 Thread Burak Yavuz
Do you really need to create a DStream from the original messaging queue?
Can't you just read them in a while loop or something on the driver?

On Fri, Jun 16, 2017 at 1:01 PM, Mike Hugo  wrote:

> Hello,
>
> I have a web application that publishes JSON messages on to a messaging
> queue that contain metadata and a link to a CSV document on S3.  I'd like
> to iterate over these JSON messages, and for each one pull the CSV document
> into spark SQL to transform it (based on the metadata in the JSON message)
> and output the results to a search index.  Each file on S3 has different
> headers, potentially different delimiters, and differing numbers of rows.
>
> Basically what I'm trying to do is something like this:
>
> JavaDStream parsedMetadataAndRows =
> queueStream.map(new Function() {
> @Override
> ParsedDocument call(String metadata) throws Exception {
> Map gson = new Gson().fromJson(metadata, Map.class)
>
> // get metadata from gson
> String s3Url = gson.url
> String delimiter = gson.delimiter
> // etc...
>
> // read s3Url
> Dataset dataFrame = sqlContext.read()
> .format("com.databricks.spark.csv")
> .option("delimiter", delimiter)
> .option("header", true)
> .option("inferSchema", true)
> .load(url)
>
> // process document,
> ParsedDocument docPlusRows = //...
>
> return docPlusRows
> })
>
> JavaEsSparkStreaming.saveToEs(parsedMetadataAndRows,
> "index/type" // ...
>
>
> But it appears I cannot get access to the sqlContext when I run this on
> the spark cluster because that code is executing in the executor not in the
> driver.
>
> Is there a way I can access or create a SqlContext to be able to pull the
> file down from S3 in my map function?  Or do you have any recommendations
> as to how I could set up a streaming job in a different way that would
> allow me to accept metadata on the stream of records coming in and pull
> each file down from s3 for processing?
>
> Thanks in advance for your help!
>
> Mike
>


Spark SQL within a DStream map function

2017-06-16 Thread Mike Hugo
Hello,

I have a web application that publishes JSON messages on to a messaging
queue that contain metadata and a link to a CSV document on S3.  I'd like
to iterate over these JSON messages, and for each one pull the CSV document
into spark SQL to transform it (based on the metadata in the JSON message)
and output the results to a search index.  Each file on S3 has different
headers, potentially different delimiters, and differing numbers of rows.

Basically what I'm trying to do is something like this:

JavaDStream parsedMetadataAndRows =
queueStream.map(new Function() {
@Override
ParsedDocument call(String metadata) throws Exception {
Map gson = new Gson().fromJson(metadata, Map.class)

// get metadata from gson
String s3Url = gson.url
String delimiter = gson.delimiter
// etc...

// read s3Url
Dataset dataFrame = sqlContext.read()
.format("com.databricks.spark.csv")
.option("delimiter", delimiter)
.option("header", true)
.option("inferSchema", true)
.load(url)

// process document,
ParsedDocument docPlusRows = //...

return docPlusRows
})

JavaEsSparkStreaming.saveToEs(parsedMetadataAndRows,
"index/type" // ...


But it appears I cannot get access to the sqlContext when I run this on the
spark cluster because that code is executing in the executor not in the
driver.

Is there a way I can access or create a SqlContext to be able to pull the
file down from S3 in my map function?  Or do you have any recommendations
as to how I could set up a streaming job in a different way that would
allow me to accept metadata on the stream of records coming in and pull
each file down from s3 for processing?

Thanks in advance for your help!

Mike


Fwd: Repartition vs PartitionBy Help/Understanding needed

2017-06-16 Thread Aakash Basu
Hi all,

Can somebody put some light on this pls?

Thanks,
Aakash.
-- Forwarded message --
From: "Aakash Basu" 
Date: 15-Jun-2017 2:57 PM
Subject: Repartition vs PartitionBy Help/Understanding needed
To: "user" 
Cc:

Hi all,
>
> Everybody is giving a difference between coalesce and repartition, but
> nowhere I found a difference between partitionBy and repartition. My
> question is, is it better to write a data set in parquet partitioning by a
> column and then reading the respective directories to work on that column
> in accordance and relevance or using repartition on that column to do the
> same in memory?
>
>
> A) One scenario is -
>
> *val partitioned_DF = df_factFundRate.repartition($"YrEqual")//New change
> for performance test*
>
>
> *val df_YrEq_true =
> partitioned_DF.filter("YrEqual=true").withColumnRenamed("validFromYr",
> "yr_id").drop("validThruYr")*
>
> *val exists = partitioned_DF.filter("YrEqual = false").count()*
> *if(exists > 0) *
>
>
>
> B) And the other scenario is -
>
> *val df_cluster = sqlContext.sql("select * from factFundRate cluster by
> YrEqual")*
> *df_factFundRate.coalesce(50).write.mode("overwrite").option("header",
> "true").partitionBy("YrEqual").parquet(args(25))*
>
> *val df_YrEq_true =
> sqlContext.read.parquet(args(25)+"YrEqual=true/").withColumnRenamed("validFromYr",
> "yr_id").drop("validThruYr")*
>
>
> *val hadoopconf = new Configuration()*
> *val fileSystem = FileSystem.get(hadoopconf)*
>
>
> *val exists = FileSystem.get(new URI(args(26)),
> sparkContext.hadoopConfiguration).exists(new
> Path(args(25)+"YrEqual=false"))*
> *if(exists)*
>
>
> The second scenario finishes within 6 mins whereas the first scenario
> takes 13 mins to complete.
>
> Please help!
>
>
> Thanks in adv,
> Aakash.
>


Re: Max number of columns

2017-06-16 Thread Jan Holmberg
Ups, wrong user list. Sorry. :-)

> On 16 Jun 2017, at 10.44, Jan Holmberg  wrote:
> 
> Hi,
> I ran into Kudu limitation of max columns (300). Same limit seemed to apply 
> latest Kudu version as well but not ex. Impala/Hive (in the same extent at 
> least).
> * is this limitation going to be loosened in near future?
> * any suggestions how to get over this limitation? Table splitting is the 
> obvious one but in my case not the desired solution.
> 
> cheers,
> -jan


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-16 Thread Georg Heiler
I assume you want to have this life cycle in oder to create big/ heavy /
complex objects only once ( per partition) map partitions should fit this
usecase pretty well.
RD  schrieb am Fr. 16. Juni 2017 um 17:37:

> Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can
> you elaborate?
>
>
>
> On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler 
> wrote:
>
>> What about using map partitions instead?
>>
>> RD  schrieb am Do. 15. Juni 2017 um 06:52:
>>
>>> Hi Spark folks,
>>>
>>> Is there any plan to support the richer UDF API that Hive supports
>>> for Spark UDFs ? Hive supports the GenericUDF API which has, among others
>>> methods like initialize(), configure() (called once on the cluster) etc,
>>> which a lot of our users use. We have now a lot of UDFs in Hive which make
>>> use of these methods. We plan to move to UDFs to Spark UDFs but are being
>>> limited by not having similar lifecycle methods.
>>>Are there plans to address these? Or do people usually adopt some
>>> sort of workaround?
>>>
>>>If we  directly use  the Hive UDFs  in Spark we pay a performance
>>> penalty. I think Spark anyways does a conversion from InternalRow to Row
>>> back to InternalRow for native spark udfs and for Hive it does InternalRow
>>> to Hive Object back to InternalRow but somehow the conversion in native
>>> udfs is more performant.
>>>
>>> -Best,
>>> R.
>>>
>>
>


Re: What is the charting library used by Databricks UI?

2017-06-16 Thread kant kodali
I have a Chrome plugin that detected all js libraries! so to answer my own
question its D3.js

On Fri, Jun 16, 2017 at 6:12 AM, Mahesh Sawaiker <
mahesh_sawai...@persistent.com> wrote:

> Is there a live url on internet, where I can see the UI? I could help by
> checking the js code in firebug.
>
>
>
> *From:* kant kodali [mailto:kanth...@gmail.com]
> *Sent:* Friday, June 16, 2017 1:26 PM
> *To:* user @spark
> *Subject:* What is the charting library used by Databricks UI?
>
>
>
> Hi All,
>
>
>
> I am wondering what is the charting library used by Databricks UI to
> display graphs in real time while streaming jobs?
>
>
>
> Thanks!
>
> DISCLAIMER == This e-mail may contain privileged and confidential
> information which is the property of Persistent Systems Ltd. It is intended
> only for the use of the individual or entity to which it is addressed. If
> you are not the intended recipient, you are not authorized to read, retain,
> copy, print, distribute or use this message. If you have received this
> communication in error, please notify the sender and delete all copies of
> this message. Persistent Systems Ltd. does not accept any liability for
> virus infected mails.
>


RE: spark-submit: file not found exception occurs

2017-06-16 Thread LisTree Team
you may use hdfs file not local file under yarn.


 Original Message 
Subject: spark-submit: file not found exception occurs
From: Shupeng Geng 
Date: Thu, June 15, 2017 8:14 pm
To: "user@spark.apache.org" ,
"d...@spark.apache.org" 

Hi, everyone,  An annoying problem occurs to me.   When submitting a spark job, the jar file not found exception is thrown as follows:    does not existread "main" java.io.FileNotFoundException: File file:/home/algo/shupeng/eeop_bridger/EeopBridger-1.0-SNAPSHOT.jar  at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534)  at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)  at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)  at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)  at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289)  at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:340)  at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:433)  at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:530)   at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$10.apply(Client.scala:529)    I’m pretty sure the file is located just where it should be.    Here is the submit shell cmd:    /usr/local/hadoop/spark-release/bin/spark-submit --master yarn --deploy-mode cluster --queue root.algo --name eeop_bridiger --class Bridger EeopBridger-1.0-SNAPSHOT.jar    Somebody give me some help, please. Many Thanks.   本邮件(包括任何附件)内容机密并受法律保护。如果您意外地收到这封邮件,请回复通知发件人并从当前系统中删除本邮件。任何未经授权者,严禁使用并部分或者全部的转发本条信息。任何未经授权的使用或传播都是被严格禁止的。远景能源与其分公司不对因不正确和不完整的转发此邮件包含的信息以及因任何因邮件延迟或对你的系统造成的损害而负责。远景能源不能保证此邮件的真实完整性,也不能确定此邮件是否含有病毒或者监听程序。  This email message (including any attachments) is confidential and may be legally privileged. If you have received it by mistake, please notify the sender by return email and delete this message from your system. Any unauthorized use or dissemination of this message in whole or in part is strictly prohibited. Envision Energy Limited and all its subsidiaries shall not be liable for the improper or incomplete transmission of the information contained in this email nor for any delay in its receipt or damage to your system. Envision Energy Limited does not guarantee the integrity of this email message, nor that this email message is free of viruses, interceptions, or interference.   



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-16 Thread RD
Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can you
elaborate?



On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler 
wrote:

> What about using map partitions instead?
>
> RD  schrieb am Do. 15. Juni 2017 um 06:52:
>
>> Hi Spark folks,
>>
>> Is there any plan to support the richer UDF API that Hive supports
>> for Spark UDFs ? Hive supports the GenericUDF API which has, among others
>> methods like initialize(), configure() (called once on the cluster) etc,
>> which a lot of our users use. We have now a lot of UDFs in Hive which make
>> use of these methods. We plan to move to UDFs to Spark UDFs but are being
>> limited by not having similar lifecycle methods.
>>Are there plans to address these? Or do people usually adopt some sort
>> of workaround?
>>
>>If we  directly use  the Hive UDFs  in Spark we pay a performance
>> penalty. I think Spark anyways does a conversion from InternalRow to Row
>> back to InternalRow for native spark udfs and for Hive it does InternalRow
>> to Hive Object back to InternalRow but somehow the conversion in native
>> udfs is more performant.
>>
>> -Best,
>> R.
>>
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
Hi Saatvik

You can write your own transformer to make sure that column contains ,value
which u provided , and filter out rows which doesn't follow the same.

Something like this


case class CategoryTransformer(override val uid : String) extends
Transformer{
  override def transform(inputData: DataFrame): DataFrame = {
inputData.select("col1").filter("col1 in ('happy')")
  }
  override def copy(extra: ParamMap): Transformer = ???
  @DeveloperApi
  override def transformSchema(schema: StructType): StructType ={
   schema
  }
}


Usage

val data = sc.parallelize(List("abce","happy")).toDF("col1")
val trans = new CategoryTransformer("1")
data.show()
trans.transform(data).show()


This transformer will make sure , you always have values in col1 as
provided by you.


Regards
Pralabh Kumar

On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah 
wrote:

> Hi Pralabh,
>
> I want the ability to create a column such that its values be restricted
> to a specific set of predefined values.
> For example, suppose I have a column called EMOTION: I want to ensure each
> row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>
> Thanks and Regards,
> Saatvik Shah
>
>
> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
> wrote:
>
>> Hi satvik
>>
>> Can u please provide an example of what exactly you want.
>>
>>
>>
>> On 16-Jun-2017 7:40 PM, "Saatvik Shah"  wrote:
>>
>>> Hi Yan,
>>>
>>> Basically the reason I was looking for the categorical datatype is as
>>> given here
>>> :
>>> ability to fix column values to specific categories. Is it possible to
>>> create a user defined data type which could do so?
>>>
>>> Thanks and Regards,
>>> Saatvik Shah
>>>
>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
>>> wrote:
>>>
 You can use some Transformers to handle categorical data,
 For example,
 StringIndexer encodes a string column of labels to a column of label
 indices:
 http://spark.apache.org/docs/latest/ml-features.html#stringindexer


 On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
 saatvikshah1...@gmail.com> wrote:

> Hi,
> I'm trying to convert a Pandas -> Spark dataframe. One of the columns
> I have
> is of the Category type in Pandas. But there does not seem to be
> support for
> this same type in Spark. What is the best alternative?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
> Spark-Dataframe-tp28764.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

>>>
>>>
>>> --
>>> *Saatvik Shah,*
>>> *1st  Year,*
>>> *Masters in the School of Computer Science,*
>>> *Carnegie Mellon University*
>>>
>>> *https://saatvikshah1994.github.io/ *
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ *
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Pralabh,

I want the ability to create a column such that its values be restricted to
a specific set of predefined values.
For example, suppose I have a column called EMOTION: I want to ensure each
row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar 
wrote:

> Hi satvik
>
> Can u please provide an example of what exactly you want.
>
>
>
> On 16-Jun-2017 7:40 PM, "Saatvik Shah"  wrote:
>
>> Hi Yan,
>>
>> Basically the reason I was looking for the categorical datatype is as
>> given here
>> : ability
>> to fix column values to specific categories. Is it possible to create a
>> user defined data type which could do so?
>>
>> Thanks and Regards,
>> Saatvik Shah
>>
>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
>> wrote:
>>
>>> You can use some Transformers to handle categorical data,
>>> For example,
>>> StringIndexer encodes a string column of labels to a column of label
>>> indices:
>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>
>>>
>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>> saatvikshah1...@gmail.com> wrote:
>>>
 Hi,
 I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
 have
 is of the Category type in Pandas. But there does not seem to be
 support for
 this same type in Spark. What is the best alternative?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
 Spark-Dataframe-tp28764.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>>
>> --
>> *Saatvik Shah,*
>> *1st  Year,*
>> *Masters in the School of Computer Science,*
>> *Carnegie Mellon University*
>>
>> *https://saatvikshah1994.github.io/ *
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ *


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
Hi satvik

Can u please provide an example of what exactly you want.



On 16-Jun-2017 7:40 PM, "Saatvik Shah"  wrote:

> Hi Yan,
>
> Basically the reason I was looking for the categorical datatype is as
> given here :
> ability to fix column values to specific categories. Is it possible to
> create a user defined data type which could do so?
>
> Thanks and Regards,
> Saatvik Shah
>
> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) 
> wrote:
>
>> You can use some Transformers to handle categorical data,
>> For example,
>> StringIndexer encodes a string column of labels to a column of label
>> indices:
>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>
>>
>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>> saatvikshah1...@gmail.com> wrote:
>>
>>> Hi,
>>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>>> have
>>> is of the Category type in Pandas. But there does not seem to be support
>>> for
>>> this same type in Spark. What is the best alternative?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>> Spark-Dataframe-tp28764.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *Saatvik Shah,*
> *1st  Year,*
> *Masters in the School of Computer Science,*
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ *
>


Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Saatvik Shah
Hi Yan,

Basically the reason I was looking for the categorical datatype is as given
here :
ability to fix column values to specific categories. Is it possible to
create a user defined data type which could do so?

Thanks and Regards,
Saatvik Shah

On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai)  wrote:

> You can use some Transformers to handle categorical data,
> For example,
> StringIndexer encodes a string column of labels to a column of label
> indices:
> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>
>
> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
> saatvikshah1...@gmail.com> wrote:
>
>> Hi,
>> I'm trying to convert a Pandas -> Spark dataframe. One of the columns I
>> have
>> is of the Category type in Pandas. But there does not seem to be support
>> for
>> this same type in Spark. What is the best alternative?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>> Spark-Dataframe-tp28764.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ *


RE: What is the charting library used by Databricks UI?

2017-06-16 Thread Mahesh Sawaiker
Is there a live url on internet, where I can see the UI? I could help by 
checking the js code in firebug.

From: kant kodali [mailto:kanth...@gmail.com]
Sent: Friday, June 16, 2017 1:26 PM
To: user @spark
Subject: What is the charting library used by Databricks UI?

Hi All,

I am wondering what is the charting library used by Databricks UI to display 
graphs in real time while streaming jobs?

Thanks!

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.



Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-16 Thread Ryan
I don't think Broadcast itself can be serialized. you can get the value out
on the driver side and refer to it in foreach, then the value would be
serialized with the lambda expr and sent to workers.

On Fri, Jun 16, 2017 at 2:29 AM, Anton Kravchenko <
kravchenko.anto...@gmail.com> wrote:

> How one would access a broadcasted variable from within
> ForeachPartitionFunction  Spark(2.0.1) Java API ?
>
> Integer _bcv = 123;
> Broadcast bcv = spark.sparkContext().broadcast(_bcv);
> Dataset df_sql = spark.sql("select * from atable");
>
> df_sql.foreachPartition(new ForeachPartitionFunction() {
> public void call(Iterator t) throws Exception {
> System.out.println(bcv.value());
> }}
> );
>
>


[Error] Python version mismatch in CDH cluster when running pyspark job

2017-06-16 Thread Divya Gehlot
Hi ,
I have a CDH cluster and running pyspark script in client mode
There are different python version installed in client and worker nodes and
was getting python version mismatch error.
To resolve this issue I followed below cludera document
https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_troubleshooting.html#workloads__job_fail_python

added below lines
export PYSPARK_PYTHON=/usr/bin/python/
export PYSPARK_DRIVER_PYTHON=python

Still getting the version mismatch error.
Does anybody encounter this issue .
Can you please share how did you resolve it .
Would really appreciate the help.

PS - attaching the screen shot of the code added .

Thanks,
Divya

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

What is the charting library used by Databricks UI?

2017-06-16 Thread kant kodali
Hi All,

I am wondering what is the charting library used by Databricks UI to
display graphs in real time while streaming jobs?

Thanks!


Max number of columns

2017-06-16 Thread Jan Holmberg
Hi,
I ran into Kudu limitation of max columns (300). Same limit seemed to apply 
latest Kudu version as well but not ex. Impala/Hive (in the same extent at 
least).
* is this limitation going to be loosened in near future?
* any suggestions how to get over this limitation? Table splitting is the 
obvious one but in my case not the desired solution.

cheers,
-jan
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



RE: [SparkSQL] Escaping a query for a dataframe query

2017-06-16 Thread mark.jenki...@baesystems.com
Thanks both!

FYI the suggestion to escape the quote does not seem to work. I should have 
mentioned I am using spark 1.6.2 and have tried to escape the double quote with 
\\ and .

My gut feel is that escape chars are not considered for UDF parameters for this 
version of spark – I would like to be wrong

[Thread-18] ERROR QueryService$ - Failed to complete query, will mark job as 
failed
java.lang.RuntimeException: [1.118] failure: ``)'' expected but `end' found

SELECT * FROM mytable WHERE mycolumn BETWEEN 1 AND 2 AND 
(myudfsearchfor(index2, "**", "*start\"end*"))


  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)

From: Gourav Sengupta [mailto:gourav.sengu...@gmail.com]
Sent: 15 June 2017 19:35
To: Michael Mior
Cc: Jenkins, Mark (UK Guildford); user@spark.apache.org
Subject: Re: [SparkSQL] Escaping a query for a dataframe query


ATTENTION. This message originates from outside BAE Systems.
It might be something that I am saying wrong but sometimes it may just make 
sense to see the difference between ” and "


<”> 8221, Hex 201d, Octal 20035

<">  34,  Hex 22,  Octal 042



Regards,


Gourav

On Thu, Jun 15, 2017 at 6:45 PM, Michael Mior 
> wrote:
Assuming the parameter to your UDF should be start"end (with a quote in the 
middle) then you need to insert a backslash into the query (which must also be 
escaped in your code). So just add two extra backslashes before the quote 
inside the string.

sqlContext.sql("SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND 
(myudfsearchfor(\"start\\\"end\"))"

--
Michael Mior
mm...@apache.org

2017-06-15 12:05 GMT-04:00 
mark.jenki...@baesystems.com 
>:
Hi,

I have a query  sqlContext.sql(“SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 
AND 2) AND (myudfsearchfor(\“start\"end\”))”



How should I escape the double quote so that it successfully parses?



I know I can use single quotes but I do not want to since I may need to search 
for a single and double quote.



The exception I get is


[Thread-18] ERROR QueryService$ - Failed to complete query, will mark job as 
failed java.lang.RuntimeException: [1.117] failure: ``)'' expected but "end" 
found

SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND 
(myudfsearchfor(\“start\"end\”))

^
  at scala.sys.package$.error(package.scala:27)
  at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
  at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
  at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
  at org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)

Thankyou
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.


Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.