Given a Avro Schema object is there a way to get StructType in Java?

2017-12-15 Thread kant kodali
Hi All,

Given a Avro Schema object is there a way to get StructType that represents
the schema in Java?

Thanks!


NASA CDF files in Spark

2017-12-15 Thread Christopher Piggott
I'm looking to run a job that involves a zillion files in a format called
CDF, a nasa standard.  There are a number of libraries out there that can
read CDFs but most of them are not high quality compared to the official
NASA one, which has java bindings (via JNI).  It's a little clumsy but I
have it working fairly well in Scala.

The way I was planning on distributing work was with
SparkContext.binaryFIles("hdfs://somepath/*) but that's really sending in
an RDD of byte[] and unfortunately the CDF library doesn't support any kind
of array or stream as input.  The reason is that CDF is really looking for
a random-access file, for performance reasons.

Whats worse, all this code is implemented down at the native layer, in C.

I think my best choice here is to distribute the job using .binaryFiles()
but then have the first task of the worker be to write all those bytes to a
ramdisk file (or maybe a real file, we'll see)... then have the CDF library
open it as if it were a local file.  This seems clumsy and awful but I
haven't come up with any other good ideas.

Has anybody else worked with these files and have a better idea?  Some info
on the library that parses all this:

https://cdf.gsfc.nasa.gov/html/cdf_docs.html


--Chris


Please Help with DecisionTree/FeatureIndexer

2017-12-15 Thread Marco Mistroni
HI all
 i am trying to run a sample decision tree, following examples here (for
Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure
out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)// Automatically identify categorical features, and index
them.val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are
treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field
"features" does not exist.
at 
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at 
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at 
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at 
org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at 
org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd

 marco


Re: Several Aggregations on a window function

2017-12-15 Thread Julien CHAMP
May be I should consider something like impala ?

Le ven. 15 déc. 2017 à 11:32, Julien CHAMP  a écrit :

> Hi Spark Community members !
>
> I want to do several ( from 1 to 10) aggregate functions using window
> functions on something like 100 columns.
>
> Instead of doing several pass on the data to compute each aggregate
> function, is there a way to do this efficiently ?
>
>
>
> Currently it seems that doing
>
>
> val tw =
>   Window
> .orderBy("date")
> .partitionBy("id")
> .rangeBetween(-803520L, 0)
>
> and then
>
> x
>.withColumn("agg1", max("col").over(tw))
>.withColumn("agg2", min("col").over(tw))
>.withColumn("aggX", avg("col").over(tw))
>
>
> Is not really efficient :/
> It seems that it iterates on the whole column for each aggregation ? Am I
> right ?
>
> Is there a way to compute all the required operations on a columns with a
> single pass ?
> Event better, to compute all the required operations on ALL columns with a
> single pass ?
>
> Thx for your Future[Answers]
>
> Julien
>
>
>
>
>
> --
>
>
> Julien CHAMP — Data Scientist
>
>
> *Web : **www.tellmeplus.com*  — *Email : 
> **jch...@tellmeplus.com
> *
>
> *Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*
> 
>
> TellMePlus S.A — Predictive Objects
>
> *Paris* : 7 rue des Pommerots, 78400 Chatou
> *Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière
>
-- 


Julien CHAMP — Data Scientist


*Web : **www.tellmeplus.com*  — *Email :
**jch...@tellmeplus.com
*

*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*


TellMePlus S.A — Predictive Objects

*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière

-- 

Ce message peut contenir des informations confidentielles ou couvertes par 
le secret professionnel, à l’intention de son destinataire. Si vous n’en 
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer 
toute copie.
This email may contain confidential and/or privileged information for the 
intended recipient. If you are not the intended recipient, please contact 
the sender and delete all copies.


-- 
 


Re: kinesis throughput problems

2017-12-15 Thread Gourav Sengupta
Hi Jeremy,

just out of curiosity - you do know that this is a SPARK user group?


Regards,
Gourav

On Thu, Dec 14, 2017 at 7:03 PM, Jeremy Kelley 
wrote:

> We have a largeish kinesis stream with about 25k events per second and
> each record is around 142k.  I have tried multiple cluster sizes, multiple
> batch sizes, multiple parameters...  I am doing minimal transformations on
> the data.  Whatever happens I can sustain consuming 25k with minimal effort
> and cluster load for about 5-10 minutes and then always always the stream
> shapes down and hovers around 5k EPS.
>
> I can give MANY more details but I was curious if anyone had seen similar
> behavior.
>
> Thanks,
> Jeremy
>
>
> --
> Jeremy Kelley | Technical Director, Data
> jkel...@carbonblack.com | Carbon Black Threat Engineering
>
>
>


Using UDF compiled with Janino in Spark

2017-12-15 Thread Michael Shtelma
Hi all,

I am trying to compile my udf with janino copmpiler and then register
it in spark and use it afterwards. Here is the code:

String s = " \n" +
"public class MyUDF implements
org.apache.spark.sql.api.java.UDF1 {\n" +
"@Override\n" +
"public String call(String s) throws Exception {\n" +
"return s+\"dd\";\n" +
"}\n" +
"};";


ISimpleCompiler sc =
CompilerFactoryFactory.getDefaultCompilerFactory().newSimpleCompiler();
sc.cook(s);
UDF1 udf1 = (UDF1) sc.getClassLoader().loadClass("MyUDF").newInstance();

sparkSession.udf().register("MyUDF", udf1, DataTypes.StringType);

sparkSession.sql("select MyUDF(id) from deal").show();

The problem is, that during the execution I am getting the following exception:

15.12.2017 14:32:46 [ERROR]: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of
org.apache.spark.rdd.MapPartitionsRDD
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15.12.2017 14:32:46 [ WARN]: Lost task 0.0 in stage 5.0 (TID 5,
localhost, executor driver): java.lang.ClassCastException: cannot
assign instance of scala.collection.immutable.List$SerializationProxy
to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_
of type scala.collection.Seq in instance of
org.apache.spark.rdd.MapPartitionsRDD
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any ideas what could go wrong?

Best,
Michael

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Several Aggregations on a window function

2017-12-15 Thread Julien CHAMP
Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window
functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate
function, is there a way to do this efficiently ?



Currently it seems that doing


val tw =
  Window
.orderBy("date")
.partitionBy("id")
.rangeBetween(-803520L, 0)

and then

x
   .withColumn("agg1", max("col").over(tw))
   .withColumn("agg2", min("col").over(tw))
   .withColumn("aggX", avg("col").over(tw))


Is not really efficient :/
It seems that it iterates on the whole column for each aggregation ? Am I
right ?

Is there a way to compute all the required operations on a columns with a
single pass ?
Event better, to compute all the required operations on ALL columns with a
single pass ?

Thx for your Future[Answers]

Julien





-- 


Julien CHAMP — Data Scientist


*Web : **www.tellmeplus.com*  — *Email :
**jch...@tellmeplus.com
*

*Phone ** : **06 89 35 01 89 <0689350189> * — *LinkedIn* :  *here*


TellMePlus S.A — Predictive Objects

*Paris* : 7 rue des Pommerots, 78400 Chatou
*Montpellier* : 51 impasse des églantiers, 34980 St Clément de Rivière

-- 

Ce message peut contenir des informations confidentielles ou couvertes par 
le secret professionnel, à l’intention de son destinataire. Si vous n’en 
êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer 
toute copie.
This email may contain confidential and/or privileged information for the 
intended recipient. If you are not the intended recipient, please contact 
the sender and delete all copies.


-- 
 


Recompute Spark outputs intelligently

2017-12-15 Thread Ashwin Raju
Hi,

We have a batch processing application that reads logs files over multiple
days, does transformations and aggregations on them using Spark and saves
various intermediate outputs to Parquet. These jobs take many hours to run.
This pipeline is deployed at many customer sites with some site specific
variations.

When we want to make changes to this data pipeline, we delete all the
intermediate output and recompute from the point of change. On some sites,
we hand write a series of "migration" transformations so we do not have to
spend hours recomputing. The reason for changes might be bugs we have found
in our data transformations or new features added to the pipeline.

As you can probably tell, maintaining all these versions and figuring out
what migrations to perform is a headache. What would be ideal is when we
apply an updated pipeline, we can automatically figure out which columns
need to be recomputed and which can be left as is.

Is there a best practice in the Spark ecosystem for this problem? Perhaps
some metadata system/data lineage system we can use? I'm curious if this is
a common problem that has already been addressed.

Thanks,
Ashwin