Re: NASA CDF files in Spark

2017-12-26 Thread Renato Marroquín Mogrovejo
There is also this project https://github.com/SciSpark/SciSpark It might be of interest to you Christopher. 2017-12-16 3:46 GMT-05:00 Jörn Franke : > Develop your own HadoopFileFormat and use https://spark.apache.org/ > docs/2.0.2/api/java/org/apache/spark/SparkContext. >

Re: Reading parquet files into Spark Streaming

2016-08-27 Thread Renato Marroquín Mogrovejo
-programmatically > > Thanks, > Akhilesh > > On Sat, Aug 27, 2016 at 3:26 AM, Renato Marroquín Mogrovejo < > renatoj.marroq...@gmail.com> wrote: > >> Anybody? I think Rory also didn't get an answer from the list ... >> >> https://mail-archives.ap

Re: Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Anybody? I think Rory also didn't get an answer from the list ... https://mail-archives.apache.org/mod_mbox/spark-user/201602.mbox/%3ccac+fre14pv5nvqhtbvqdc+6dkxo73odazfqslbso8f94ozo...@mail.gmail.com%3E 2016-08-26 17:42 GMT+02:00 Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.

Reading parquet files into Spark Streaming

2016-08-26 Thread Renato Marroquín Mogrovejo
Hi all, I am trying to use parquet files as input for DStream operations, but I can't find any documentation or example. The only thing I found was [1] but I also get the same error as in the post (Class parquet.avro.AvroReadSupport not found). Ideally I would like to do have something like this:

Re: mutable.LinkedHashMap kryo serialization issues

2016-08-26 Thread Renato Marroquín Mogrovejo
Hi Rahul, You have probably already figured this one out, but anyway... You need to register the classes that you'll be using with Kryo because it does not support all Serializable types and requires you to register the classes you’ll use in the program in advance. So when you don't register the

Re: Spark for offline log processing/querying

2016-05-23 Thread Renato Marroquín Mogrovejo
We also did some benchmarking using analytical queries similar to TPC-H both with Spark and Presto, and our conclussion was that Spark is a great general solution but for analytical SQL queries it is still not there yet. I mean for 10 or 100GB of data you will get your results back but with Presto

Re: Datasets is extremely slow in comparison to RDD in standalone mode WordCount examlpe

2016-05-12 Thread Renato Marroquín Mogrovejo
Hi Amit, This is very interesting indeed because I have got similar resutls. I tried doing a filtter + groupBy using DataSet with a function, and using the inner RDD of the DF(RDD[row]). I used the inner RDD of a DataFrame because apparently there is no straight-forward way to create an RDD of

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
park is finding > the library correctly, otherwise the error message would be "no libraryname > found" or something like that. The problem seems to be something else, and > I'm not sure how to find it. > > Thanks, > Bernardo > > On 14 October 2015 at 16:28, Renato Ma

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
Sorry Bernardo, I just double checked. I use: System.loadLibrary(); Could you also try that? Renato M. 2015-10-14 21:51 GMT+02:00 Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com>: > Hi Bernardo, > > So is this in distributed mode? or single node? Mayb

Re: Running in cluster mode causes native library linking to fail

2015-10-14 Thread Renato Marroquín Mogrovejo
You can also try setting the env variable LD_LIBRARY_PATH to point where your compiled libraries are. Renato M. 2015-10-14 21:07 GMT+02:00 Bernardo Vecchia Stein : > Hi Deenar, > > Yes, the native library is installed on all machines of the cluster. I > tried a

Doubts about SparkSQL

2015-05-23 Thread Renato Marroquín Mogrovejo
Hi all, I have some doubts about the latest SparkSQL. 1. In the paper about SparkSQL it has been stated that The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. ... If dealing with a query of the form:

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
using rows directly: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#programmatically-specifying-the-schema Avro or parquet input would likely give you the best performance. On Tue, Apr 21, 2015 at 4:28 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Thanks

Re: SparkSQL performance

2015-04-21 Thread Renato Marroquín Mogrovejo
, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi

Re: SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up

SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema,

Spark caching

2015-03-30 Thread Renato Marroquín Mogrovejo
Hi all, I am trying to understand Spark lazy evaluation works, and I need some help. I have noticed that creating an RDD once and using it many times won't trigger recomputation of it every time it gets used. Whereas creating a new RDD for every time a new operation is performed will trigger

Re: Spark caching

2015-03-30 Thread Renato Marroquín Mogrovejo
.) On Mon, Mar 30, 2015 at 9:43 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi all, I am trying to understand Spark lazy evaluation works, and I need some help. I have noticed that creating an RDD once and using it many times won't trigger recomputation of it every time

[Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Renato Marroquín Mogrovejo
Hi Spark experts, Is there a way to convert a JavaSchemaRDD (for instance loaded from a parquet file) back to a JavaRDD of a given case class? I read on stackOverFlow[1] that I could do a select over the parquet file and then by reflection get the fields out, but I guess that would be an