Re: NASA CDF files in Spark

2017-12-26 Thread Renato Marroquín Mogrovejo
There is also this project

https://github.com/SciSpark/SciSpark

It might be of interest to you Christopher.

2017-12-16 3:46 GMT-05:00 Jörn Franke :

> Develop your own HadoopFileFormat and use https://spark.apache.org/
> docs/2.0.2/api/java/org/apache/spark/SparkContext.
> html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.
> Class,%20java.lang.Class,%20java.lang.Class) to load. The Spark
> datasource API will be relevant for you in the upcoming version 2 as an
> alternative.
>
> On 16. Dec 2017, at 03:33, Christopher Piggott  wrote:
>
> I'm looking to run a job that involves a zillion files in a format called
> CDF, a nasa standard.  There are a number of libraries out there that can
> read CDFs but most of them are not high quality compared to the official
> NASA one, which has java bindings (via JNI).  It's a little clumsy but I
> have it working fairly well in Scala.
>
> The way I was planning on distributing work was with
> SparkContext.binaryFIles("hdfs://somepath/*) but that's really sending in
> an RDD of byte[] and unfortunately the CDF library doesn't support any kind
> of array or stream as input.  The reason is that CDF is really looking for
> a random-access file, for performance reasons.
>
> Whats worse, all this code is implemented down at the native layer, in C.
>
> I think my best choice here is to distribute the job using .binaryFiles()
> but then have the first task of the worker be to write all those bytes to a
> ramdisk file (or maybe a real file, we'll see)... then have the CDF library
> open it as if it were a local file.  This seems clumsy and awful but I
> haven't come up with any other good ideas.
>
> Has anybody else worked with these files and have a better idea?  Some
> info on the library that parses all this:
>
> https://cdf.gsfc.nasa.gov/html/cdf_docs.html
>
>
> --Chris
>
>


Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Renato Marroquín Mogrovejo
Hi Akhilesh,

Thanks for your response.
I am using Spark 1.6.1 and what I am trying to do is to ingest parquet
files into the Spark Streaming, not in batch operations.

val ssc = new StreamingContext(sc, Seconds(5))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class",
"parquet.avro.AvroReadSupport")

val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val oDStream = ssc.fileStream[Void, Order,
ParquetInputFormat]("TempData/origin/")

oDStream.foreachRDD(relation => {
  if (relation.count() == 0)
println("Nothing received")
  else {
val rDF = relation.toDF().as[Order]
println(rDF.first())
  }
})

But that doesn't work. Any ideas?


Best,

Renato M.

2016-08-27 9:01 GMT+02:00 Akhilesh Pathodia <pathodia.akhil...@gmail.com>:

> Hi Renato,
>
> Which version of Spark are you using?
>
> If spark version is 1.3.0 or more then you can use SqlContext to read the
> parquet file which will give you DataFrame. Please follow the below link:
>
> https://spark.apache.org/docs/1.5.0/sql-programming-guide.
> html#loading-data-programmatically
>
> Thanks,
> Akhilesh
>
> On Sat, Aug 27, 2016 at 3:26 AM, Renato Marroquín Mogrovejo <
> renatoj.marroq...@gmail.com> wrote:
>
>> Anybody? I think Rory also didn't get an answer from the list ...
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201602.
>> mbox/%3CCAC+fRE14PV5nvQHTBVqDC+6DkXo73oDAzfqsLbSo8F94ozO5nQ@
>> mail.gmail.com%3E
>>
>>
>>
>> 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo <
>> renatoj.marroq...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I am trying to use parquet files as input for DStream operations, but I
>>> can't find any documentation or example. The only thing I found was [1] but
>>> I also get the same error as in the post (Class
>>> parquet.avro.AvroReadSupport not found).
>>> Ideally I would like to do have something like this:
>>>
>>> val oDStream = ssc.fileStream[Void, Order, ParquetInputFormat[Order]]("da
>>> ta/")
>>>
>>> where Order is a case class and the files inside "data" are all parquet
>>> files.
>>> Any hints would be highly appreciated. Thanks!
>>>
>>>
>>> Best,
>>>
>>> Renato M.
>>>
>>> [1] http://stackoverflow.com/questions/35413552/how-do-i-read-in
>>> -parquet-files-using-ssc-filestream-and-what-is-the-nature
>>>
>>
>>
>


Re: Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Anybody? I think Rory also didn't get an answer from the list ...

https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccac+fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E



2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com>:

> Hi all,
>
> I am trying to use parquet files as input for DStream operations, but I
> can't find any documentation or example. The only thing I found was [1] but
> I also get the same error as in the post (Class
> parquet.avro.AvroReadSupport not found).
> Ideally I would like to do have something like this:
>
> val oDStream = ssc.fileStream[Void, Order, ParquetInputFormat[Order]]("
> data/")
>
> where Order is a case class and the files inside "data" are all parquet
> files.
> Any hints would be highly appreciated. Thanks!
>
>
> Best,
>
> Renato M.
>
> [1] http://stackoverflow.com/questions/35413552/how-do-i-
> read-in-parquet-files-using-ssc-filestream-and-what-is-the-nature
>


Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Hi all,

I am trying to use parquet files as input for DStream operations, but I
can't find any documentation or example. The only thing I found was [1] but
I also get the same error as in the post (Class
parquet.avro.AvroReadSupport not found).
Ideally I would like to do have something like this:

val oDStream = ssc.fileStream[Void, Order,
ParquetInputFormat[Order]]("data/")

where Order is a case class and the files inside "data" are all parquet
files.
Any hints would be highly appreciated. Thanks!


Best,

Renato M.

[1]
http://stackoverflow.com/questions/35413552/how-do-i-read-in-parquet-files-using-ssc-filestream-and-what-is-the-nature


Re: mutable.LinkedHashMap kryo serialization issues

2016-08-26 Thread Renato Marroquín Mogrovejo
Hi Rahul,

You have probably already figured this one out, but anyway...
You need to register the classes that you'll be using with Kryo because it
does not support all Serializable types and requires you to register the
classes you’ll use in the program in advance. So when you don't register
the class, Kryo doesn't know how to serialize/deserialize it.


Best,

Renato M.

2016-08-22 17:12 GMT+02:00 Rahul Palamuttam :

> Hi,
>
> Just sending this again to see if others have had this issue.
>
> I recently switched to using kryo serialization and I've been running into
> errors
> with the mutable.LinkedHashMap class.
>
> If I don't register the mutable.LinkedHashMap class then I get an
> ArrayStoreException seen below.
> If I do register the class, then when the LinkedHashMap is collected on
> the driver, it does not contain any elements.
>
> Here is the snippet of code I used :
>
> val sc = new SparkContext(new SparkConf()
>   .setMaster("local[*]")
>   .setAppName("Sample")
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .registerKryoClasses(Array(classOf[mutable.LinkedHashMap[String, String]])))
>
> val collect = sc.parallelize(0 to 10)
>   .map(p => new mutable.LinkedHashMap[String, String]() ++= Array(("hello", 
> "bonjour"), ("good", "bueno")))
>
> val mapSideSizes = collect.map(p => p.size).collect()(0)
> val driverSideSizes = collect.collect()(0).size
>
> println("The sizes before collect : " + mapSideSizes)
> println("The sizes after collect : " + driverSideSizes)
>
>
> ** The following only occurs if I did not register the
> mutable.LinkedHashMap class **
> 16/08/20 18:10:38 ERROR TaskResultGetter: Exception while getting task
> result
> java.lang.ArrayStoreException: scala.collection.mutable.HashMap
> at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$
> ObjectArraySerializer.read(DefaultArraySerializers.java:338)
> at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$
> ObjectArraySerializer.read(DefaultArraySerializers.java:293)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at org.apache.spark.serializer.KryoSerializerInstance.
> deserialize(KryoSerializer.scala:311)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:97)
> at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$
> anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:60)
> at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$
> anonfun$run$1.apply(TaskResultGetter.scala:51)
> at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$
> anonfun$run$1.apply(TaskResultGetter.scala:51)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
> at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(
> TaskResultGetter.scala:50)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> I hope this is a known issue and/or I'm missing something important in my
> setup.
> Appreciate any help or advice!
>
> Best,
>
> Rahul Palamuttam
>


Re: Spark for offline log processing/querying

2016-05-23 Thread Renato Marroquín Mogrovejo
We also did some benchmarking using analytical queries similar to TPC-H
both with Spark and Presto, and our conclussion was that Spark is a great
general solution but for analytical SQL queries it is still not there yet.
I mean for 10 or 100GB of data you will get your results back but with
Presto was way faster and predictable. But of course if you are planning to
do machine learning or ad-hoc data processing, then Spark is the right
solution.


Renato M.

2016-05-23 9:38 GMT+02:00 Mat Schaffer :

> It's only really mildly interactive. When I used presto+hive in the past
> (just a consumer not an admin) it seemed to be able to provide answers
> within ~2m even for fairly large data sets. Hoping I can get a similar
> level of responsiveness with spark.
>
> Thanks, Sonal! I'll take a look at the example log processor and see what
> I can come up with.
>
>
> -Mat
>
> matschaffer.com
>
> On Mon, May 23, 2016 at 3:08 PM, Jörn Franke  wrote:
>
>> Do you want to replace ELK by Spark? Depending on your queries you could
>> do as you proposed. However, many of the text analytics queries will
>> probably be much faster on ELK. If your queries are more interactive and
>> not about batch processing then it does not make so much sense. I am not
>> sure why you plan to use Presto.
>>
>> On 23 May 2016, at 07:28, Mat Schaffer  wrote:
>>
>> I'm curious about trying to use spark as a cheap/slow ELK
>> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
>>
>> - instances rotate local logs
>> - copy rotated logs to s3
>> (s3://logs/region/grouping/instance/service/*.logs)
>> - spark to convert from raw text logs to parquet
>> - maybe presto to query the parquet?
>>
>> I'm still new on Spark though, so thought I'd ask if anyone was familiar
>> with this sort of thing and if there are maybe some articles or documents I
>> should be looking at in order to learn how to build such a thing. Or if
>> such a thing even made sense.
>>
>> Thanks in advance, and apologies if this has already been asked and I
>> missed it!
>>
>> -Mat
>>
>> matschaffer.com
>>
>>
>


Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-12 Thread Renato Marroquín Mogrovejo
Hi Amit,

This is very interesting indeed because I have got similar resutls. I tried
doing a filtter + groupBy using DataSet with a function, and using the
inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame because
apparently there is no straight-forward way to create an RDD of Parquet
data without creating a sqlContext. if anybody has some code to share with
me, please share (:
I used 1GB of parquet data and when doing the operations with the RDD it
was much faster. After looking at the execution plans, it is clear why
DataSets do worse. For using them an extra map operation is done to map row
objects into the defined case class. Then the DataSet uses the whole query
optimization platform (Catalyst and move objects in and out of Tungsten).
Thus, I think for operations that are too "simple", it is more expensive to
use the entire DS/DF infrastructure than the inner RDD.
IMHO if you have complex SQL queries, it makes sense you use DS/DF but if
you don't, then probably using RDDs directly is still faster.


Renato M.

2016-05-11 20:17 GMT+02:00 Amit Sela :

> Some how missed that ;)
> Anything about Datasets slowness ?
>
> On Wed, May 11, 2016, 21:02 Ted Yu  wrote:
>
>> Which release are you using ?
>>
>> You can use the following to disable UI:
>> --conf spark.ui.enabled=false
>>
>> On Wed, May 11, 2016 at 10:59 AM, Amit Sela  wrote:
>>
>>> I've ran a simple WordCount example with a very small List as
>>> input lines and ran it in standalone (local[*]), and Datasets is very slow..
>>> We're talking ~700 msec for RDDs while Datasets takes ~3.5 sec.
>>> Is this just start-up overhead ? please note that I'm not timing the
>>> context creation...
>>>
>>> And in general, is there a way to run with local[*] "lightweight" mode
>>> for testing ? something like without the WebUI server for example (and
>>> anything else that's not needed for testing purposes)
>>>
>>> Thanks,
>>> Amit
>>>
>>
>>


Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
Hi Bernardo,

So is this in distributed mode? or single node? Maybe fix the issue with a
single node first ;)
You are right that Spark finds the library but not the *.so file. I also
use System.load() with LD_LIBRARY_PATH set, and I am able to
execute without issues. Maybe you'd like to double check paths, env
variables, or the parameters spark.driver.extraLibraryPath,
spark.executor.extraLibraryPath.


Best,

Renato M.

2015-10-14 21:40 GMT+02:00 Bernardo Vecchia Stein <bernardovst...@gmail.com>
:

> Hi Renato,
>
> I have done that as well, but so far no luck. I believe spark is finding
> the library correctly, otherwise the error message would be "no libraryname
> found" or something like that. The problem seems to be something else, and
> I'm not sure how to find it.
>
> Thanks,
> Bernardo
>
> On 14 October 2015 at 16:28, Renato Marroquín Mogrovejo <
> renatoj.marroq...@gmail.com> wrote:
>
>> You can also try setting the env variable LD_LIBRARY_PATH to point where
>> your compiled libraries are.
>>
>>
>> Renato M.
>>
>> 2015-10-14 21:07 GMT+02:00 Bernardo Vecchia Stein <
>> bernardovst...@gmail.com>:
>>
>>> Hi Deenar,
>>>
>>> Yes, the native library is installed on all machines of the cluster. I
>>> tried a simpler approach by just using System.load() and passing the exact
>>> path of the library, and things still won't work (I get exactly the same
>>> error and message).
>>>
>>> Any ideas of what might be failing?
>>>
>>> Thank you,
>>> Bernardo
>>>
>>> On 14 October 2015 at 02:50, Deenar Toraskar <deenar.toras...@gmail.com>
>>> wrote:
>>>
>>>> Hi Bernardo
>>>>
>>>> Is the native library installed on all machines of your cluster and are
>>>> you setting both the spark.driver.extraLibraryPath and
>>>> spark.executor.extraLibraryPath ?
>>>>
>>>> Deenar
>>>>
>>>>
>>>>
>>>> On 14 October 2015 at 05:44, Bernardo Vecchia Stein <
>>>> bernardovst...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I am trying to run some scala code in cluster mode using spark-submit.
>>>>> This code uses addLibrary to link with a .so that exists in the machine,
>>>>> and this library has a function to be called natively (there's a native
>>>>> definition as needed in the code).
>>>>>
>>>>> The problem I'm facing is: whenever I try to run this code in cluster
>>>>> mode, spark fails with the following message when trying to execute the
>>>>> native function:
>>>>> java.lang.UnsatisfiedLinkError:
>>>>> org.name.othername.ClassName.nativeMethod([B[B)[B
>>>>>
>>>>> Apparently, the library is being found by spark, but the required
>>>>> function isn't found.
>>>>>
>>>>> When trying to run in client mode, however, this doesn't fail and
>>>>> everything works as expected.
>>>>>
>>>>> Does anybody have any idea of what might be the problem here? Is there
>>>>> any bug that could be related to this when running in cluster mode?
>>>>>
>>>>> I appreciate any help.
>>>>> Thanks,
>>>>> Bernardo
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
Sorry Bernardo, I just double checked. I use: System.loadLibrary();
Could you also try that?


Renato M.

2015-10-14 21:51 GMT+02:00 Renato Marroquín Mogrovejo <
renatoj.marroq...@gmail.com>:

> Hi Bernardo,
>
> So is this in distributed mode? or single node? Maybe fix the issue with a
> single node first ;)
> You are right that Spark finds the library but not the *.so file. I also
> use System.load() with LD_LIBRARY_PATH set, and I am able to
> execute without issues. Maybe you'd like to double check paths, env
> variables, or the parameters spark.driver.extraLibraryPath,
> spark.executor.extraLibraryPath.
>
>
> Best,
>
> Renato M.
>
> 2015-10-14 21:40 GMT+02:00 Bernardo Vecchia Stein <
> bernardovst...@gmail.com>:
>
>> Hi Renato,
>>
>> I have done that as well, but so far no luck. I believe spark is finding
>> the library correctly, otherwise the error message would be "no libraryname
>> found" or something like that. The problem seems to be something else, and
>> I'm not sure how to find it.
>>
>> Thanks,
>> Bernardo
>>
>> On 14 October 2015 at 16:28, Renato Marroquín Mogrovejo <
>> renatoj.marroq...@gmail.com> wrote:
>>
>>> You can also try setting the env variable LD_LIBRARY_PATH to point where
>>> your compiled libraries are.
>>>
>>>
>>> Renato M.
>>>
>>> 2015-10-14 21:07 GMT+02:00 Bernardo Vecchia Stein <
>>> bernardovst...@gmail.com>:
>>>
>>>> Hi Deenar,
>>>>
>>>> Yes, the native library is installed on all machines of the cluster. I
>>>> tried a simpler approach by just using System.load() and passing the exact
>>>> path of the library, and things still won't work (I get exactly the same
>>>> error and message).
>>>>
>>>> Any ideas of what might be failing?
>>>>
>>>> Thank you,
>>>> Bernardo
>>>>
>>>> On 14 October 2015 at 02:50, Deenar Toraskar <deenar.toras...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Bernardo
>>>>>
>>>>> Is the native library installed on all machines of your cluster and
>>>>> are you setting both the spark.driver.extraLibraryPath and
>>>>> spark.executor.extraLibraryPath ?
>>>>>
>>>>> Deenar
>>>>>
>>>>>
>>>>>
>>>>> On 14 October 2015 at 05:44, Bernardo Vecchia Stein <
>>>>> bernardovst...@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I am trying to run some scala code in cluster mode using
>>>>>> spark-submit. This code uses addLibrary to link with a .so that exists in
>>>>>> the machine, and this library has a function to be called natively 
>>>>>> (there's
>>>>>> a native definition as needed in the code).
>>>>>>
>>>>>> The problem I'm facing is: whenever I try to run this code in cluster
>>>>>> mode, spark fails with the following message when trying to execute the
>>>>>> native function:
>>>>>> java.lang.UnsatisfiedLinkError:
>>>>>> org.name.othername.ClassName.nativeMethod([B[B)[B
>>>>>>
>>>>>> Apparently, the library is being found by spark, but the required
>>>>>> function isn't found.
>>>>>>
>>>>>> When trying to run in client mode, however, this doesn't fail and
>>>>>> everything works as expected.
>>>>>>
>>>>>> Does anybody have any idea of what might be the problem here? Is
>>>>>> there any bug that could be related to this when running in cluster mode?
>>>>>>
>>>>>> I appreciate any help.
>>>>>> Thanks,
>>>>>> Bernardo
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
You can also try setting the env variable LD_LIBRARY_PATH to point where
your compiled libraries are.


Renato M.

2015-10-14 21:07 GMT+02:00 Bernardo Vecchia Stein 
:

> Hi Deenar,
>
> Yes, the native library is installed on all machines of the cluster. I
> tried a simpler approach by just using System.load() and passing the exact
> path of the library, and things still won't work (I get exactly the same
> error and message).
>
> Any ideas of what might be failing?
>
> Thank you,
> Bernardo
>
> On 14 October 2015 at 02:50, Deenar Toraskar 
> wrote:
>
>> Hi Bernardo
>>
>> Is the native library installed on all machines of your cluster and are
>> you setting both the spark.driver.extraLibraryPath and
>> spark.executor.extraLibraryPath ?
>>
>> Deenar
>>
>>
>>
>> On 14 October 2015 at 05:44, Bernardo Vecchia Stein <
>> bernardovst...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I am trying to run some scala code in cluster mode using spark-submit.
>>> This code uses addLibrary to link with a .so that exists in the machine,
>>> and this library has a function to be called natively (there's a native
>>> definition as needed in the code).
>>>
>>> The problem I'm facing is: whenever I try to run this code in cluster
>>> mode, spark fails with the following message when trying to execute the
>>> native function:
>>> java.lang.UnsatisfiedLinkError:
>>> org.name.othername.ClassName.nativeMethod([B[B)[B
>>>
>>> Apparently, the library is being found by spark, but the required
>>> function isn't found.
>>>
>>> When trying to run in client mode, however, this doesn't fail and
>>> everything works as expected.
>>>
>>> Does anybody have any idea of what might be the problem here? Is there
>>> any bug that could be related to this when running in cluster mode?
>>>
>>> I appreciate any help.
>>> Thanks,
>>> Bernardo
>>>
>>
>>
>


Doubts about SparkSQL

2015-05-23 Thread Renato Marroquín Mogrovejo
Hi all,

I have some doubts about the latest SparkSQL.

1. In the paper about SparkSQL it has been stated that The physical
planner also performs rule-based physical optimizations, such as pipelining
projections or filters into one Spark map operation. ...

If dealing with a query of the form:

select *  from (
  select * from tableA where date1  '19-12-2015'
)A
where attribute1 = 'valueA' and attribute2 = 'valueB'

Could I be sure that the both filters are applied sequentially in-memory
i.e. first applying the date filter and over that result set, the next
attributes filter gets applied? Or will two different Map-only operations
will be spawned?

2. Does the Catalyst query optimizer is aware of how data was partitioned?
or does it not make any assumptions on this?
Thanks in advance!


Renato M.


Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks Michael!
I have tried applying my schema programatically but I didn't get any
improvement on performance :(
Could you point me to some code examples using Avro please?
Many thanks again!


Renato M.

2015-04-21 20:45 GMT+02:00 Michael Armbrust mich...@databricks.com:

 Here is an example using rows directly:

 https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

 Avro or parquet input would likely give you the best performance.

 On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Thanks for the hints guys! much appreciated!
 Even if I just do a something like:

 Select * from tableX where attribute1  5

 I see similar behaviour.

 @Michael
 Could you point me to any sample code that uses Spark's Rows? We are at a
 phase where we can actually change our JavaBeans for something that
 provides a better performance than what we are seeing now. Would you
 recommend using Avro presentation then?
 Thanks again!


 Renato M.

 2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between
 0 and 5 that I run over a Kryo file with four partitions that ends up
 being around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a 
 SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is
 just a filter. Internally, the relation files are mapped to a JavaBean.
 This different data presentation (JavaBeans vs SparkSQL internal
 representation) could lead to such difference? Is there anything I could 
 do
 to make the performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.








Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
Thanks for the hints guys! much appreciated!
Even if I just do a something like:

Select * from tableX where attribute1  5

I see similar behaviour.

@Michael
Could you point me to any sample code that uses Spark's Rows? We are at a
phase where we can actually change our JavaBeans for something that
provides a better performance than what we are seeing now. Would you
recommend using Avro presentation then?
Thanks again!


Renato M.

2015-04-21 1:18 GMT+02:00 Michael Armbrust mich...@databricks.com:

 There is a cost to converting from JavaBeans to Rows and this code path
 has not been optimized.  That is likely what you are seeing.

 On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote:

 SparkSQL optimizes better by column pruning and predicate pushdown,
 primarily. Here you are not taking advantage of either.

 I am curious to know what goes in your filter function, as you are not
 using a filter in SQL side.

 Best
 Ayan
 On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com wrote:

 Does anybody have an idea? a clue? a hint?
 Thanks!


 Renato M.

 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
 renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0
 and 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around
 ~9.6 seconds but when I apply schema, register the table into a SqlContext,
 and then run the query, it takes around ~16 seconds. This is using Spark
 1.2.1 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just
 a filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.






Re: SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Does anybody have an idea? a clue? a hint?
Thanks!


Renato M.

2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com:

 Hi all,

 I have a simple query Select * from tableX where attribute1 between 0 and
 5 that I run over a Kryo file with four partitions that ends up being
 around 3.5 million rows in our case.
 If I run this query by doing a simple map().filter() it takes around ~9.6
 seconds but when I apply schema, register the table into a SqlContext, and
 then run the query, it takes around ~16 seconds. This is using Spark 1.2.1
 with Scala 2.10.0
 I am wondering why there is such a big gap on performance if it is just a
 filter. Internally, the relation files are mapped to a JavaBean. This
 different data presentation (JavaBeans vs SparkSQL internal representation)
 could lead to such difference? Is there anything I could do to make the
 performance get closer to the hard-coded option?
 Thanks in advance for any suggestions or ideas.


 Renato M.



SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Hi all,

I have a simple query Select * from tableX where attribute1 between 0 and
5 that I run over a Kryo file with four partitions that ends up being
around 3.5 million rows in our case.
If I run this query by doing a simple map().filter() it takes around ~9.6
seconds but when I apply schema, register the table into a SqlContext, and
then run the query, it takes around ~16 seconds. This is using Spark 1.2.1
with Scala 2.10.0
I am wondering why there is such a big gap on performance if it is just a
filter. Internally, the relation files are mapped to a JavaBean. This
different data presentation (JavaBeans vs SparkSQL internal representation)
could lead to such difference? Is there anything I could do to make the
performance get closer to the hard-coded option?
Thanks in advance for any suggestions or ideas.


Renato M.


Spark caching

2015-03-30 Thread Renato Marroquín Mogrovejo
Hi all,

I am trying to understand Spark lazy evaluation works, and I need some
help. I have noticed that creating an RDD once and using it many times
won't trigger recomputation of it every time it gets used. Whereas creating
a new RDD for every time a new operation is performed will trigger
recomputation of the whole RDD again.
I would have thought that both approaches behave similarly (i.e. not
caching) due to Spark's lazy evaluation strategy, but I guess Spark is
keeps track of the RDD used and of the partial results computed so far so
it doesn't do unnecessary extra work. Could anybody point me to where Spark
decides what to cache or how I can disable this behaviour?
Thanks in advance!


Renato M.

Approach 1 -- this doesn't trigger recomputation of the RDD in every
iteration
=
JavaRDD aggrRel = Utils.readJavaRDD(...).groupBy(groupFunction).
map(mapFunction);
for (int i = 0; i  NUM_RUNS; i++) {
   // doing some computation like aggrRel.count()
   . . .
}

Approach 2 -- this triggers recomputation of the RDD in every iteration
=
for (int i = 0; i  NUM_RUNS; i++) {
   JavaRDD aggrRel =
Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
   // doing some computation like aggrRel.count()
   . . .
}


Re: Spark caching

2015-03-30 Thread Renato Marroquín Mogrovejo
Thanks Sean!
Do you know if there is a way (even manually) to delete these intermediate
shuffle results? I was just want to test the expected behaviour. I know
that re-caching might be a positive action most of the times but I want to
try it without it.


Renato M.

2015-03-30 12:15 GMT+02:00 Sean Owen so...@cloudera.com:

 I think that you get a sort of silent caching after shuffles, in
 some cases, since the shuffle files are not immediately removed and
 can be reused.

 (This is the flip side to the frequent question/complaint that the
 shuffle files aren't removed straight away.)

 On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo
 renatoj.marroq...@gmail.com wrote:
  Hi all,
 
  I am trying to understand Spark lazy evaluation works, and I need some
 help.
  I have noticed that creating an RDD once and using it many times won't
  trigger recomputation of it every time it gets used. Whereas creating a
 new
  RDD for every time a new operation is performed will trigger
 recomputation
  of the whole RDD again.
  I would have thought that both approaches behave similarly (i.e. not
  caching) due to Spark's lazy evaluation strategy, but I guess Spark is
 keeps
  track of the RDD used and of the partial results computed so far so it
  doesn't do unnecessary extra work. Could anybody point me to where Spark
  decides what to cache or how I can disable this behaviour?
  Thanks in advance!
 
 
  Renato M.
 
  Approach 1 -- this doesn't trigger recomputation of the RDD in every
  iteration
  =
  JavaRDD aggrRel =
  Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
  for (int i = 0; i  NUM_RUNS; i++) {
 // doing some computation like aggrRel.count()
 . . .
  }
 
  Approach 2 -- this triggers recomputation of the RDD in every iteration
  =
  for (int i = 0; i  NUM_RUNS; i++) {
 JavaRDD aggrRel =
  Utils.readJavaRDD(...).groupBy(groupFunction).map(mapFunction);
 // doing some computation like aggrRel.count()
 . . .
  }



[Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Renato Marroquín Mogrovejo
Hi Spark experts,

Is there a way to convert a JavaSchemaRDD (for instance loaded from a
parquet file) back to a JavaRDD of a given case class? I read on
stackOverFlow[1] that I could do a select over the parquet file and then by
reflection get the fields out, but I guess that would be an overkill.
Then I saw [2] from 2014 which says that this feature would be available in
the future. So could you please let me know how I can accomplish this?
Thanks in advance!


Renato M.

[1]
http://stackoverflow.com/questions/26181353/how-to-convert-spark-schemardd-into-rdd-of-my-case-class
[2]
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html